An IntroBDUCtion
================
by Harry Mangalam <harry.mangalam@uci.edu>
v1.62 - Sept 8th, 2012
:icons:


//Harry Mangalam mailto:harry.mangalam@uci.edu[harry.mangalam@uci.edu]
// this file is converted to the HTML via the command:

// fileroot="/home/hjm/nacs/BDUC_USER_HOWTO"; asciidoc -a icons -a toc2 -b html5 -a numbered ${fileroot}.txt; scp ${fileroot}.html ${fileroot}.txt  moo:~/public_html/bduc

// update svn from BDUC
// scp ${fileroot}.txt  hmangala@claw1:~/bduc/trunk/sge; ssh hmangala@bduc-login 'cd ~/bduc/trunk/sge; svn update; svn commit -m "new mods to BDUC_USER_HOWTO"'

// and push it to Wordpress:
// blogpost.py  update  -c HowTos ${fileroot}.txt

// don't forget that the HTML equiv of '~' = '%7e'
// asciidoc cheatsheet: http://powerman.name/doc/asciidoc
// asciidoc user guide: http://www.methods.co.nz/asciidoc/userguide.html


[[beforeyoustart]]
Please read this
----------------
- BDUC is a shared facility, run on almost no budget, by a few part-time contributors. 

- BDUC is 'NOT' your personal machine.  It's shared by about 400 users of whom 20-50 may be using it at any one time. (Once connected, type 'w' into the terminal to see who's on the machine at the same time as you.) Actions you take on BDUC affect all other users.

- BDUC has finite resources and bandwith.  It's only via the communal use of the SGE scheduler that it remains a usable resource.  It uses 1Gigabit ethernet to connect the nodes which can support about 100MB/s per connection.  That sounds like a lot, but not when it's being shared by 50 others and especially not when 15 of those others are all trying to copy 20GB files back and forth (see below).

If you are unfamiliar with idea of a cluster, please read link:#clustercomputing[this brief description of cluster computing].


[[whatswrong]]
How to let us know what's wrong
-------------------------------
Since BDUC is a research cluster, it is in perpetual flux as apps, libraries, and modules are added, updated, or modified so sometimes a bug will creep in where none existed before.  When you find something missing or a behavior that seems odd, please let us know.  You can mailto:bduc-request@uci.edu?Subject=User%20Comment[email the BDUC admins here].

Note that it will help considerably if you tell us more than 'It doesn't work', or 'I can't log in'.  If you want quick resolution of the problem, please send us as much relevant info as possible, including a description of 'what triggered the misbehavior'.  For us to help you, we have to be able to re-create the problem, so include the commandline you used, including all the options, the input and output file paths and preferably the command prompt which should include the node from which it was issued, and the time.

If the misbehavior involves an error message, doing a Google search on that error message *verbatim* will often produce the answer. (see http://moo.nac.uci.edu/~hjm/FixITYourselfWithGoogle.html[Fix IT yourself with Google].

While many of you are not programmers, you're dealing with programs, and if we are to have any hope of debugging the process that caused the failure, the more info the better (usually).  PLEASE READ  http://www.chiark.greenend.org.uk/~sgtatham/bugs.html[How to Report Bugs Effectively] before you report a failure.  At least glance at it.

If you're going to spend a lot of time with computers, you should also read Eric Raymond's encyclopedic http://www.catb.org/~esr/faqs/smart-questions.html[How To Ask Questions The Smart Way].   It will be of use thru your life.

If your problem seems to be about a missing library, such as:

----------------------------------------------------------------------------
unable to load shared object '/apps/R/2.14.0/lib64/R/modules/libfunky.so.2':/
   libfrenemy.so.3: cannot open shared object file: No such file or 
directory
----------------------------------------------------------------------------
Please see the link:#missinglibs[Resolving missing libraries] Appendix section about resolving this error.


[[whatis]]
What is a BDUC?
---------------
The 'Broadcom Distributed Unified Cluster' (BDUC) is, as the name suggests, a distributed group of clusters unified by running under a single http://moo.nac.uci.edu/~hjm/Sun_Grid_Engine_62_install_and_config.pdf[Sun Grid Engine] (SGE) Resource Manager.
BDUC consists of  subclusters of 2-48core AMD64 Opteron nodes (for a total of about 600cores) running 64bit Linux.  One group of 80 CentOS nodes, 35 Debian-based nodes is in the NACS Academic Data Center and another of 40 nodes is in the ICS data center, for a total of ~155 nodes / ~600 cores. There is another smaller subcluster (see BEAR, below) running Kubuntu. 

The nodes are interconnected with 1Gb ethernet and have the MPICH, MPICH2, and OpenMPI (preferred)  environments for parallel jobs. All the nodes share a common */home* which is on a RAID6 system but which is NOT backed up.  If you generate valuable data, you should move it off ASAP.  

*To be perfectly clear.  If the BDUC storage system crashes, you will irretrievably lose all your data.*

[[bear]]
What is a BEAR?
---------------
The 'Broadcom EA Replacement' (BEAR) is a Broadcom-supplied subcluster consisting of 8 larger nodes and some condo nodes administered  especially for interactive use.  These nodes each have 4-64 Opteron cores and 32-256GB RAM.  Most run the 64bit Kubuntu (10.04.3) Desktop Edition, so you can have access to the full graphical KDE desktop via VNC or nx, as well as the individual GUI applications and shell utilities.  BEAR is fully integrated with BDUC and shares its */home* directories, but has a different, larger set of applications.  One of the nodes (claw1) is half-reserved for interactive use; the others can be used for both interactive and batch runs (currently limited to 48hrs) on the *claws* Q.  The claw6 node is usually reserved for running the commercial CLCBio application.  mailto:harry.mangalam@uci.edu[Contact me] if you want access to that system.

'You can compile and run jobs on all the claw nodes, but don't saturate claw1 with multiple serial or parallel jobs.'

Condo Nodes
-----------
BDUC supports the use of 'condo nodes' - privately purchased, but grafted onto the BDUC infrastructure to take advantage of the shared applications and administration.  These nodes typically are configured to allow public jobs to run on them when their owners are not using them.  For example, the a64-182, a64-183, claw9 and claw10 nodes are privately owned but available for public jobs.  If the owners want to reclaim all the cores for a heavy analysis job, jobs running on it may be suspended.


How do I get an account?
------------------------
You request an account by sending a message *including your UCINetID* to mailto:bduc-request@uci.edu[<bduc-request@uci.edu>]. Please let us know in that message if you want to use the SGE batch system to submit long-running or multiple jobs.  You should get an acknowledgement within a few hours and your account should be available then.
By default, BDUC & BEAR are open to all postgrad UCI researchers, altho it will be available to undergrads with faculty sponsorship.

There is no cost to use BDUC, but neither is there any right to use it.  Your account may be terminated if we observe activity that runs counter to good cluster citizenship.

[[connect]]
How do I connect to BDUC?
~~~~~~~~~~~~~~~~~~~~~~~~~
You 'must' use http://en.wikipedia.org/wiki/Secure_Shell[ssh], an encrypted terminal protocol. Be sure to use the '-Y' or '-X' options, if you want to view X11 graphics (link:#graphics[see below]).

*On a Mac*, use the 'Applications -> Utilities -> Terminal' app. +
*On a WinPC*, use the excellent http://www.chiark.greenend.org.uk/~sgtatham/putty/[putty]. See also link:#XonWin[below]. +
*On Linux*, I assume that you know how to start a Terminal session with one of the bazillion terminal apps (http://konsole.kde.org/[konsole] & http://software.jessies.org/terminator/[terminator] are 2 good ones). +

http://en.wikipedia.org/wiki/Telnet[Telnet] access is NOT available. 
Use your UCINetID and associated password to log into the login node (bduc-login.nacs.uci.edu) via *ssh*.

To connect using a Mac or Linux, open the Terminal app and type:
-----------------------------------------------------------------------------
ssh -Y UCINetID@bduc-login.nacs.uci.edu
# the '-Y' requests that the X11 protocol is tunneled back to you inside of ssh.
-----------------------------------------------------------------------------

As of June 15th, 2009, you can also ssh directly to the claw1 node for a 64bit interactive node from anywhere on campus.  

-----------------------------------------------------------------------------
ssh -Y UCINetID@bduc-claw1.nacs.uci.edu
-----------------------------------------------------------------------------

[[passwordless_ssh]]
How to set up passwordless ssh
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.Passwordless ssh setup is now automatic
[NOTE]
==================================================================================
From *Nov. 15th, 2009* onwards, this is set up for you automatically when your account is activated, so you no longer have to do this manually.  However, as a reference for those of you who want to set it up on other machines, I've moved the documentation to the link:#HowtoPasswordlessSsh[Appendix]. The automatic setup also includes setting the '~/.ssh/config' file to prevent the "first time ssh challenge problem".

If a Mac or Linux user, you may also be interested in using ssh to execute commands on remote machines.  This is http://moo.nac.uci.edu/~hjm/SSHoutingWithSsh.html[described here.]
==================================================================================

Note that in order to help you debug login and other problems, the sysadmin's public ssh keys are also added to your '~/.ssh/authorized_keys' file.  If you do not want this, you're welcome to comment it out, but unless it's active, I can't help you with problems that require a direct login.

[[ssherrors]]
ssh errors
~~~~~~~~~~

Occasionally you may get the error below when you try to log into BDUC (or more rarely, among the BDUC nodes):

-----------------------------------------------------------------------
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
 Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
93:c1:d0:97:e8:a0:f5:91:13:89:7d:94:6c:aa:9b:8c.
 Please contact your system administrator.
Add correct host key in /Users/joeuser/.ssh/known_hosts to get rid of this message.
Offending key in /Users/joeuser/.ssh/known_hosts:2
RSA host key for bduc.nacs.uci.edu has changed and you have requested strict checking.
 Host key verification failed.
-----------------------------------------------------------------------

The reason for this error is that the computer to which you're connecting to has changed its identification key.  This might be due to the mentioned 'man-in-the-middle' attack but is far more likely to be an administrative change that has caused the BDUC node to have changed its ID.  This may be due to a change in hardware, reconfiguration of the node, a reboot, an upgrade, etc.

The fix is buried in the error message itself.
-----------------------------------------------------------------------
Offending key in /Users/joeuser/.ssh/known_hosts:2
-----------------------------------------------------------------------
Simply edit that file and delete the line referenced.  When you log in again, there will be a notification that the key has been added to your 'known_hosts' file.

Should you want to be able to log in regardless of this warning, you'll have to edit the '/etc/ssh/ssh_config' file and add the 2 lines as shown below. (Macs, Linux).   There are http://goo.gl/rCeE[good reasons for not doing this], but it's a convenience that many of us use.  Consider it the 'rolling stop' of ssh security.

-----------------------------------------------------------------------
Host *
        StrictHostKeyChecking ask
-----------------------------------------------------------------------
After you do that, you'll still get the warning (which you should investigate) but you'll be able to log in.

If you're using http://www.chiark.greenend.org.uk/~sgtatham/putty/[putty] on Windows, you won't be able to effect this security skip-around. http://goo.gl/rCeE[Read why here].


After you log in...
~~~~~~~~~~~~~~~~~~~

Logging in to *bduc.nacs.uci.edu* will give you access to a Linux shell, (http://www.gnu.org/software/bash/[bash] by default, http://www.tcsh.org/Home[tcsh], ksh available).

.Some bash pointers.
[NOTE]
===========================================================================
The default shell (or environment in which you type commands) for your
BDUC login is bash.  It looks like the Windows CMD shell, but it is
MUCH more powerful.  There's a good exposition of some of the things
you can do with the shell
http://www.catonmat.net/blog/the-definitive-guide-to-bash-command-line-history/[here]
and a
http://www.catonmat.net/blog/wp-content/plugins/wp-downloadMonitor/user_uploads/bash-history-cheat-sheet.pdf[good cheatsheet here].
If you're going to spend some time working on BDUC, it's worth your
while to learn some of the more advanced commands and tricks.

If you're going to be using BDUC for more than a few times, it's useful to set up a file of aliases to useful commands and then 'source' that file from your '~/.bashrc'.  
ie:
---------------------------------------------------------------------------
# the ~/.aliases file contains shortcuts for frequently used commands
# your ~/.bashrc file should source that file: '. ~/aliases'
alias dabrick="ssh -Y dabrick"
alias asrch="apt-cache search"
alias ashow="apt-cache show"
alias hg="history|grep " # search history for this regex
alias pa="ps aux |grep " # search processes for this regex
alias nu="ls -lt | head -11"  # what are the 11 newest files?
alias big="ls -lhS | head -20"  # what are the 20 biggest files?
# and even some more complicated commands
alias edaccheck='cd /sys/devices/system/edac/mc &&  grep [0-9]* mc*/csrow*/[cu]e_count'
---------------------------------------------------------------------------

You can also customize your bash prompt to produce more info than the default 'user@host'.
While you're waiting for your calculations to finish, check out the definitive
http://tldp.org/HOWTO/Bash-Prompt-HOWTO[bash prompt HOWTO] and / or use http://bashish.sourceforge.net/[bashish] to customize your bash environment.

http://www.dirb.info[DirB] is a set of bash functions that make it very easy
to bookmark and skip back and forth to those bookmarks. Download the file from
the URL above, 'source' it early in your '.bashrc' and then read how to use it
via http://moo.nac.uci.edu/~hjm/DirB.pdf[this link].  It's very simple and
very effective. Very briefly, 's bookmark' to set a bookmark, 'g bookmark' to cd to
bookmark, 'sl' to list bookmarks.  Recommended if you have deep dir trees and
need to keep hopping among the leaves.
===========================================================================

// proof'ed to here.

You will also have access to the resources of the BDUC via the SGE commands.  The most frequently used commands for SGE will be 'qrsh' to request an interactive node and 'qsub' to submit a batch job.  You can also check the status of various resources with the 'qconf' command.  See the http://gridengine.info/files/SGE_Cheat_Sheet.pdf[SGE cheatsheet] for more detail.

The login node should be considered your 1st stop in doing real work.  You can copy files to and from your home directory from the login node, but you shouldn't run any long (>10m) jobs on the login node.  If you do and we notice, we'll kill them off.  To do real work, request a node from the interactive queue, like this:

-----------------------------------------------------------------
# for a 64bit interactive node
hmangala@bduc-login:~ $ qrsh -q int

# wait a few seconds...

hmangala@a64-001:~

#or you can ssh directly to one of the claw nodes:

ssh claw1

# or 

ssh -Y claw1 # with X11 forwarding set up

-----------------------------------------------------------------

[[datastorageonbduc]]
Data Storage on BDUC
--------------------

No limits, but no warnings either
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We have not yet imposed disk quotas on BDUC.
We encourage you to use the data storage you need, up to hundreds of GB, but we also warn you that if we detect large directories that have not been used in weeks, we retain the right to clean them out.  The larger the dataset, the more scrutiny it will get.
IF YOU HAVE LARGE DATASETS AND ARE NOT USING THEM, THEY MAY DISAPPEAR WITHOUT WARNING. We mean it when we say that if you generate valuable data, it is up to you to back it up elsewhere ASAP.


If you have no idea of how large your data is and how it is distributed, you can find out via the 'du' command (disk usage).

-----------------------------------------------------------------
$ cd /home
$ du -sh hmangala   # you would substitute *your* home dir
5.3G    hmangala/
-----------------------------------------------------------------

To see the distribution of files graphically,

-----------------------------------------------------------------
$ cd; ssh -Y claw1 'kdirstat'
-----------------------------------------------------------------

This will launch http://kdirstat.sourceforge.net/[kdirstat] which will determine the size, type and age of your files and present them in a color-coded map by size.  You can then inspect and hopefully remove the ones least needed.


[[filestoandfrom]]
How do I get my files to and from BDUC?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is covered in more detail in the document  http://moo.nac.uci.edu/\~hjm/HOWTO_move_data.html[HOWTO_move_data].
There are currently a few ways to get your files to and from BDUC.  The most direct, most available way is via http://en.wikipedia.org/wiki/Secure_copy[scp].  Besides the commandline *scp* utility bundled with all Linux and Mac machines, there are GUI clients for MacOSX and  Windows, and of course, Linux.  If you have large collections of files or large individual files that change only partially, you might be interested in using http://moo.nac.uci.edu/%7ehjm/HOWTO_move_data.html#rsync[rsync] as well.

Once you copy your data to your BDUC HOME directory, it is available to all the compute nodes via the same mount point on each, so if you need to refer to it in a SGE script, you can reference the same file in the same way on all nodes. ie: '/home/hmangala/my/file' will be the same file on all nodes.


.Make sure bash knows if this is an interactive login
[NOTE]
==================================================================================
If you have customized your '.bashrc' to spit out some useful data when you log in (such as the number of jobs you have running), make sure to wrap that command in a test for an interactive shell.  Otherwise, when you try to 'scp' or 'sftp' or 'rsync' data to your BDUC account, your shell will unexpectedly vomit up the same text into the connecting program with unpleasant results.  Wrap those commands with something like this in your '.bashrc':

----------------------------------------------------------
interactive=`echo $- | grep -c i `
if [ ${interactive} = 1 ] ; then
  # put all your intereractive stuff in here:
  # ie tell me what my 22 newest files are
  ls -lt | head -22
fi
----------------------------------------------------------
==================================================================================

Windows
^^^^^^^
The hands-down, no-question-about-it, go-to utility here is the free http://www.winscp.net[WinSCP], which gives you a graphical interface for SCP, SFTP and FTP.

.Line endings in files from Windows vs Linux/Unix/MacOSX
**************************************************************
If you are creating data on Windows and saving it as 'plain text' for use on Linux, many Windows applications will save the data with DOS 'end-of-line' characters (a carriage return plus a line feed aka 'CRLF') as opposed to the Linux/MacOSX newline (a line feed alone aka 'LF').  This may cause problems on Linux as some applications will detect and automatically correct Windows newlines but others will not.  Ditto visual editors which you might think would give you an indication of this.  Most editors will give you a choice as to which newline you want when you save the file, but in some the choice is not obvious.  In any case, unless you're sure of how your data is formatted, you can pass it though the Linux utility 'dos2unix' which will replace the Windows newline with a Linux newline:

 $ dos2unix windows.file linux.file

http://en.wikipedia.org/wiki/Newline[Read the whole sordid history of the newline here]
**************************************************************

MacOSX
^^^^^^
There may be others but it looks like the winner here is the oddly named, but freely available http://cyberduck.ch/[Cyberduck], which provides graphical file browsing via FTP, SCP/SFTP, WebDAV, and even Amazon S3(!).

Linux
^^^^^
The full range of high-speed net data commandline utilities are available via the above-referenced http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html[HOWTO_move_data], however, for ease of use, it may well be easiest to use the built-in capabilities of KDE's Swiss Army knife browser http://www.konqueror.org[Konqueror] or twin panel file manager http://www.krusader.org/[Krusader] which both support the secure file browser http://www.linux.com/feature/124686[kio-plugin] called http://isdepartment.wordpress.com/2007/04/04/introduction-to-the-kdes-kio-slaves-using-fish[fish].  If you use a *fish URL*, you can connect the server via shared keys or via password:
-----------------------------------------------------------------
fish://hmangala@bduc.nacs.uci.edu
-----------------------------------------------------------------

[[archivemount]]
archivemount
~~~~~~~~~~~~
Once you've generated some data on BDUC, you may want to keep it handy for a short time while you're further processing it.  In order to keep it both compact and accessible, BDUC supports the 'archivemount' utility on both the 'login' and 'claw1' nodes.  This allows you to mount a compressed archive (tar.gz, tar.bz2, and zip archives) on a mountpoint as a http://en.wikipedia.org/wiki/Filesystem_in_Userspace[fuse filesystem].  You can 'cd' into the archive, modify files in place, copy files out of the archive, or copy files into the archive.  When you unmount the archive, the changes are saved into the archive. Here's an http://www.linux-mag.com/id/7825[extended article on it from Linux Mag].

Here's an example of how to use 'archivemount' with a 84MB data tarball ('data.tar.gz') that you want to interact with.

-----------------------------------------------------------------
# how big is this thang?
$ ls -lh
total 84M
-rw-r--r-- 1 hmangala hmangala 84M Jun 15 14:55 jksrc.zip

# OK - 84MB, which is fine.  Now let's make a mount point for it.

$ mkdir jk

$ ls
jk/  jksrc.zip

# so now we have a zipfile and a mountpoint.  That's all we need to archivemount
# let's time it just to see how long it takes to unpack and mount this archive:

$ time archivemount jksrc.zip jk

real    0m0.810s  <-  less than a second wall clock time
user    0m0.682s
sys     0m0.112s

$ cd jk      # cd into the top of the file tree.

# lets see what the top of this file tree looks like.  All file utils can work on this data structure
$ tree |head -11
.
`-- kent
    |-- build
    |   |-- build.crontab
    |   |-- dosEolnCheck
    |   |-- kentBuild
    |   |-- kentGetNBuild
    |   `-- makeErrFilter
    |-- java
    |   |-- build
    |   |-- build.xml
<etc>

# and the bottom of the file tree.
$ tree |tail
            |   |-- wabaCrude.h
            |   `-- wabaCrude.sql
            |-- xaShow
            |   |-- makefile
            |   `-- xaShow.c
            `-- xenWorm
                |-- makefile
                `-- xenWorm.c

2286 directories, 12793 files <- lots of files that don't take up anymore 'real' space on the disk.


# how does it show up with 'df'?  See the last line..

$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md2             373484336  11607976 342598364   4% /
/dev/md1               1019144     47180    919356   5% /boot
tmpfs                  8254876         0   8254876   0% /dev/shm
/dev/sdc             12695180544 6467766252 6227414292  51% /data
bduc-sched.nacs.uci.edu:/share/sge62
                      66946520   8335072  55155872  14% /sge62
fuse                 1048576000         0 1048576000   0% /home/hmangala/build/fs/jk


# finally, !!IMPORTANTLY!! un-mount it.

$ cd ..   # cd out of the tree

$ fusermount -u jk    # unmount it with 'fusermount -u'

-----------------------------------------------------------------

.Don't make huge archives if you're going to use archivemount
[NOTE]
==================================================================================
'archivemount' has to "unpack" the archive before it mounts it, so trying to 'archivemount' an
enormous archive will be slow and frustrating.  If you're planning on using this approach,
please restrict the size of your archives to  ~100MB.

If you need to process huge files, please consider using http://en.wikipedia.org/wiki/NetCDF[netCDF] or http://en.wikipedia.org/wiki/HDF5[HDF5] formated files and http://nco.sf.net[nco] or http://www.pytables.org/moin[pytables] to process them.  'NetCDF' and 'HDF5' are highly structured, binary formats that are both extremely compact and extremely fast to parse/process.  BDUC has a number of utilities for processing both types of files including http://www.r-project.org/[R], http://nco.sf.net[nco], and https://wci.llnl.gov/codes/visit/[VISIT].
==================================================================================


[[sshfs]]
sshfs
~~~~~

http://en.wikipedia.org/wiki/SSHFS[sshfs] is a utility that allows you to mount remote directories in your BDUC home dir.  Since it operates in 'user-mode', you don't have to be 'root' or use 'sudo' to use it. It's very easy to use and you don't have to alert us to use it..

You have to be able to ssh to the machine from which you want to exchange files, typically the desktop or laptop you're connecting to BDUC from (ergo WinPCs cannot do this without much more effort).  For MacOSX and Linux, in the example below assume I'm connecting from a laptop named 'ringo' to the BDUC 'claw1' node.  I have a valid BDUC login ('hmangala') and my login on ringo is 'frodo'.

'sshfs' works on both the 'login' and 'claw1' nodes.

-----------------------------------------------------------------
frodo@ringo:~ $ ssh bduc-claw1  # from ringo, ssh to BDUC with passwordless ssh

 # <BDUC login stuff deleted>

# make a dir named 'ringo' for the ringo filesystem mountpoint
hmangala@bduc-claw1:~  $ mkdir ringo

# sshfs-attach the remote filesystem to BDUC on ~/ringo
# NOTE: you usually have to provide the FULL PATH to the remote dir, not '~'
# using '~' on the local side (the last arg) is OK.
# ie: this is wrong:
# hmangala@bduc-claw1:~  $ sshfs frodo@ringo.dept.uci.edu:~ ringo
#                                                         ^
# the following is right:
hmangala@bduc-claw1:~  $ sshfs frodo@ringo.dept.uci.edu:/home/frodo ~/ringo

hmangala@bduc-claw1:~  $ ls -l |head
total 4790888
drwxr-xr-x   2 hmangala hmangala          6 Dec 10 14:17 ringo/  # the new mountpoint for ringo
-rw-r--r--   1 hmangala hmangala       3388 Sep 22 16:25 9.2.zip
-rw-r--r--   1 hmangala hmangala       4636 Dec  8 10:18 acct
-rw-r--r--   1 hmangala hmangala        501 Dec  8 10:20 acct.cpu.user
-rwxr-xr-x   1 hmangala hmangala        892 Nov 11 08:55 alias*
-rw-r--r--   1 hmangala hmangala        691 Sep 30 13:21 all3.needs

 <etc>         ^^^^^^^^^^^^^^^^^ note the ownership

# now I cd into the 'ringo' dir
hmangala@bduc-claw1:~  $ cd ringo

hmangala@bduc-claw1:~/ringo  $ ls -lt |head
total 4820212
drwxr-xr-x 1 frodo frodo       20480 2009-12-10 14:43 nacs/
drwxr-xr-x 1 frodo frodo        4096 2009-12-10 14:41 Mail/
-rw------- 1 frodo frodo          61 2009-12-10 12:54 ~Untitled
-rw-r--r-- 1 frodo frodo          42 2009-12-10 12:44 testfromclaw
-rw-r--r-- 1 frodo frodo      627033 2009-12-10 11:22 sun_virtualbox_3.1.pdf

#<etc>       ^^^^^^^^^^^ note the ownership.  Even tho I'm on bduc-claw1, the original ownership is intact

# writing from BDUC to ringo filesystem
hmangala@bduc-claw1:~/ringo  $ echo "testing testing" > test_from_bduc

hmangala@bduc-claw1:~/ringo  $ cat test_from_bduc
testing testing

hmangala@bduc-claw1:~/ringo  $ ls -lt |head
total 4820216
drwxr-xr-x 1 frodo frodo       20480 2009-12-10 14:47 nacs/
-rw-r--r-- 1 frodo frodo          16 2009-12-10 14:46 test_from_bduc
drwxr-xr-x 1 frodo frodo        4096 2009-12-10 14:41 Mail/
#            ^^^^^^^^^^^  even tho I wrote it as 'hmangala' on BDUC, it's owned by 'frodo'

# and finally, unmount the sshfs mounted filesystem.
hmangala@bduc-claw1:~/ringo $ fusermount -u ringo

# get more info on sshfs with 'man sshfs'
-----------------------------------------------------------------

[[yourdata]]
YOU are responsible for your data
---------------------------------
We *do not* have the resources to provide backups of your data.  If you store valuable data on BDUC, it is 'ENTIRELY' your responsibility to protect it by backing it up elsewhere. You can do so via the mechanisms discussed above, especially with (if using a Mac or Linux) rsync, which will copy only those bytes which have changed, making it extremely efficient.  Using rsync (with examples) http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html#rsync[is described here].


How do I do stuff?
------------------

On the login node, you shouldn't do anything too strenuous (computationally).  If you run something that takes more than a minute or so to complete, you should be running on an interactive node or submit it to one of the batch queues.

*qrsh* given alone will start an 'ssh -Y' session with one of the nodes in the interactive Q.

Can I compile code?
~~~~~~~~~~~~~~~~~~~

We have the full GNU toolchain available on both the CentOS interactive nodes and on all the Ubuntu/claw nodes, so normal compilation tools such as autoconf, automake,  libtool, make, ant, gcc, g++, gfortran, gdb, ddd, java, python, R, perl, etc are available to you.  We do not yet have any proprietary compilers or debuggers available (ie. the Intel or PGC compilers or the TotalView Debugger).  Please let us know if there are other tools or libraries you need that aren't available.

As of Jan 15, the GNU 4.6.2 series of compilers is available to all nodes.  Set up the environment with 'module load gcc/4.6.2'.  Applications compiled with this series of compilers should be able to run on all nodes (the a64-XXX CentOS nodes, the clawX Ubuntu nodes and the nXXX Perceus nodes).

Compiling your own code
^^^^^^^^^^^^^^^^^^^^^^^
You can always compile your own (or downloaded) code.  Compile it in its own subdir and when you've built the executables, install it rooted from your own home directory.

If the code is well-designed, it should have a 'configure' shell script in the top-level dir.  The './configure --help' command should then give you a list of all the parameters it accepts.  Typically, all such scripts will accept
the '--prefix' flag.  You can use this to tell it to install everything
in your $HOME dir.

ie
---------------------------------------------------------------------
./configure --prefix=/home/you ...other options..
---------------------------------------------------------------------

This command, when it completes successfully will generate a 'Makefile'.
At this point, you can type 'make' (or 'make -j2' to compile on 2 CPUs) and the code will be compiled into whatever kind of executable is called for. Once the code has been compiled successfully (there may be a 'make test' or 'make check' option to run tests to check for this), you can install it in your $HOME directory tree with 'make install'.

ie
---------------------------------------------------------------------
/home/you/bin
/home/you/man
/home/you/lib
/home/you/include
/home/you/share
<etc>
---------------------------------------------------------------------

Then you can run it out of your '\~/bin' dir without interfering with other
code.  In order for you to be able to run it transparently, you will have to prepend your '\~/bin' to the 'PATH' environment variable, typically by editing it into the appropriate line in your '~/.bashrc'.

---------------------------------------------------------------------
export PATH=~/bin:${PATH}
---------------------------------------------------------------------

[[appsavailable]]
How do I find out what's available?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[[modules]]
Via the module command
^^^^^^^^^^^^^^^^^^^^^^
We use the tcl-based http://modules.sourceforge.net/[environment module system] to wrangle
non-standard software versions and subsystems into submission.
To find out what modules are available, simply type:
-----------------------------------------------------------------
$ module avail  # 'modav' gives you a different (scrollable) listing

---------------------------- /apps/Modules/MATH_STATS_ENG -----------------------------
R/2.12.1           R/2.15.1           matlab/R2011b      rapidminer/5.1.001
R/2.13.0           matlab/R2008b      octave/3.0.1       rstudio/0.95.265
R/2.14.0           matlab/R2009b      octave/3.2.0       scilab/5.3.1
R/2.15.0           matlab/R2010b      opensees/2.2.2

------------------------------ /apps/Modules/PROGRAMMING ------------------------------
antlr/3.2              interviews/17          openmpi/1.4.4
ants/1.9               java/1.6               petsc/3.1-p8
ants/1.9.x             matlab/R2008b          pgc/10.6
atlas/3.10.0           matlab/R2009b          python/2.6.1
bioperl/1.6.1          matlab/R2010b          python/2.7.2
boost/1.410            matlab/R2011b          readline/5.2
boost/1.51.0           mpfr/2.4.2             sage/5.1
cern_root/5.32.03      mpich/1.2.7            scilab/5.3.1
enthought_python/6.3-2 mpich2/1.1.1p1         sge/6.2
enthought_python/7.3.1 mpich2/1.2.1p1         sparsehash/1.6
gcc/4.6.2              nagios-client/1.4.16   sqlite/3.6.22
gmp/4.3.2              ncl/5.1.1              sqlite/3.7.9
gnu_parallel/20101202  nco/4.0.4              subversion/1.6.9
gpu/1.0                netcdf/3.6.3           subversion/1.7.1
hadoop/0.20.2          netcdf/4.1.1           tcl/8.5.5
hdf5/1.8.4p1           open64/4.2.3           tcl/8.5.9
hdf5/1.8.5.p1          openmpi/1.4.2          tk/8.5.5
hdf5/1.8.7             openmpi/1.4.3
hdf5/1.8.9             openmpi/1.4.3_P

------------------------------- /apps/Modules/DATA_VIZ --------------------------------
gnuplot/4.2.4 matlab/R2008b matlab/R2010b scilab/5.3.1
gnuplot/4.5p1 matlab/R2009b matlab/R2011b visit/1.11.2

---------------------------- /apps/Modules/IMAGE_ANALYSIS -----------------------------
freesurfer/4.5.1Dev fsl/4.1             loni_pipeline/5.1.4
freesurfer/5.0.0    fsl/4.1.6           mricron/12.2009
freesurfer/5.1.0    imagej/1.41         simset/2.9

-------------------------- /apps/Modules/BIO_GENO_ASSEMBLERS --------------------------
abyss/1.2.5        bowtie/2.0.0b7     ray/1.4            stampy/1.0.20
abyss/1.2.6        bwa/0.5.7          ray/1.6.1          velvet/1.0.19
allpathslg/36681   bwa/0.5.9          ray/1.7            velvet/1.1.02
bfast/0.6.3c       bwa/0.6.1          soap/2.20          velvet/1.2.03
bowtie/0.12.3      edena/2.1.1        ssaha2/2.5.3       velvet/1.2.06
bowtie/0.12.7      gapcloser/20100125 stampy/1.0.12      velvet/1.2.07
bowtie/0.12.8      maq/0.7.1          stampy/1.0.15
bowtie/2.0.0b6     mosaik/1.0.1388    stampy/1.0.17

---------------------------- /apps/Modules/BIO_GENO_UTILS -----------------------------
annovar/2010Jan17   blat/3.4            lastz/1.02          samtools/0.1.18
annovar/2012Feb23   cd-hit/4.5.7        meme/4.7.0          samtools/0.1.7
annovar/2012Mar08   cd-hit/4.6          meme/4.8.1          tabix/0.2.5
annovar/2012May25   exonerate/2.2       msort/20081208      tablet/1.11.01.25
bamtools/2.1.1      gatk/1.0.5336       multiz-tba/012109   tacg/4.5.1
bedtools/2.15.0     gatk/1.6-5-g557da77 nmica/0.8.0         taverna/2.2.0
bedtools/2.6.1      gtf2gff3/0.1        picard/1.45         triton/4.0.0
bioperl/1.6.1       igv/1.5.58          picard/1.67         vcftools/0.1.9
blast/2.2.26+       khmer/8.2011        samtools/0.1.13

----------------------- /apps/Modules/BIO_GENO_VARIANTS_LINKAGE -----------------------
haploview/4.1 hugeseq/1.0.1 pindel/0.2.4q plink/1.07

------------------ /apps/Modules/BIO_GENO_GENE_STRUCTURE_PREDICTION -------------------
augustus/2.6        cnver/0.7.2         mauve/2.3.1         rmblast/1.2
augustus/2.6.1      cnver/0.8.1         recon/1.70          sva/1.02
breakdancer/1.2     cnvnator/0.2.5      repeatmasker/3.3.0
breakseq/1.3        impute/2.2.2        repeatmodeler/1.0.5
breakway/0.6        mach/1.0.18         repeatscout/1.0.5

------------------------------ /apps/Modules/BIO_EVOMICS ------------------------------
eigensoft/3.0 hyphy/2.0     orthomcl/1.4  phast/1.222.1
hmmer/3.0     mrbayes/3.1.2 paml/4.4

-------------------------- /apps/Modules/BIO_CHIPSEQ_RNASEQ ---------------------------
MACS/1.4.2-1    cufflinks/2.0.0 express/1.1.1   tophat/1.2.0
cufflinks/0.8.1 cufflinks/2.0.2 miso/0.4.1      tophat/1.3.1
cufflinks/1.1.0 express/1.0.0   quest/2.4       tophat/1.4.0
cufflinks/1.3.0 express/1.1.0   tophat/1.0.13   tophat/2.0.4

---------------------------- /apps/Modules/BIO_MOLDYNAMICS ----------------------------
amber/11               gromacs_s_ompi/4.5.4_P namd/2.8b1
autodock/4.2.3         gromacs_s_ompi/4.5.5   nwchem/6.0
bionetgen/2.2.0        maestro/91207          rosetta/3.1
charmm/c35b6           mgltools/1.5.4         ssc/0.6
gamess/2010R1          modeller/9v7           tinker/5.1.09
gaussian/3.0           namd/2.6               vmd/1.8.7
gromacs_s_ompi/4.5.4   namd/2.7b1             zdock/3.0.1

------------------------------- /apps/Modules/BIO_NEURO -------------------------------
neuron/7.0

(current as of Oct 18th, 2012)
-----------------------------------------------------------------

To load a particular module, use the 'module load <module/version>' command:
-----------------------------------------------------------------
$ module load imagej/1.41  # for example
-----------------------------------------------------------------

If a module has a dependency, it should set it up for you automatically. Let us know if it doesn't.  If you note that a module has an update that we should install, tell us.  

Also, if you neglect the version number, it will load the numerically highest version, which does not necessarily mean the latest, since some groups use odd numbering schemes.  For example, samtools/0.1.7 is numerically higher (but older) than samtools/0.1.18. 

[[honeydo]]
.If you want an app upgraded/updated
[NOTE]
===========================================================================
If you need the newest version of an app, FIRST make sure that we don't
already have it installed.  See 'module avail' above. THEN please supply us
with a link to the updated version so we don't have to scour the internet
for it. If it's going to require a long dependency list, please also supplyy
us with an indication of what that is. If it's an app that few other people
will ever use, consider downloading it and installing it in your own ~/bin
directory.  If after that you think it's worthwhile, we'd certainly consider
installing it system-wide.

===========================================================================

Via the shell
^^^^^^^^^^^^^
This is a bit tricky.  there are literally thousands of applications that are available and many of them have names that are entirely unrelated to their function.  In order to determine whether a well-know application is already on the system, you can simply try typing its name.  If it's NOT installed or not on your executable's PATH, the shell will return *command not found*.

All the interactive nodes have *TAB completion* enabled at least in the bash shell.  This means that if you type a few characters of the name and hit <TAB> twice, the system will try to compete the command for you.  If there are multiple executables that match those characters, the shell will present all the alternatives to you. ie

-----------------------------------------------------------------
$ jo<TAB><TAB>
jobs        jockey-kde  joe         join
-----------------------------------------------------------------

You can then complete the command or enter enough characters to make the command unique and hit <TAB> again and the command will complete.

Via the installer Database
^^^^^^^^^^^^^^^^^^^^^^^^^^

The 2 installer databases (one for Ubuntu's *apt-get* on the claw nodes, one for CentOS's *yum* on the rest) will let you search all the applications that HAVE been installed and all those that CAN be installed.

To search for the ones that CAN be installed on the BEAR (claw1-4) nodes, use the command *asrch* (an alias for *apt-get search*).  This searches thru all the application names and descriptions in a case-insensitve search to find a wide variety of names that match the pattern you give it.  For example:
-----------------------------------------------------------------
$ asrch biology
avida-base - Auto-adaptive genetic system for Artificial Life research
biomode - [Biology] An Emacs mode to edit genetic data
bioperl - Perl tools for computational molecular biology
   <41 lines deleted>
molphy - [Biology] Program Package for MOLecular PHYlogenetics
phylip - [Biology] A package of programs for inferring phylogenies
phylip-doc - [Biology] A package of programs for inferring phylogenies
treetool - [Biology] An interactive tool for displaying trees
tacg - [Biology] a sophisticated 'grep' for nucleic acid strings
-----------------------------------------------------------------

To see a more detailed descripton of the application, use *ashow* (an alias for *apt-get show*), which will provide a few lines or paragraphs of text about the application:
-----------------------------------------------------------------
$ ashow phylip
Package: phylip
Priority: optional
Section: multiverse/science
Installed-Size: 5792
Maintainer: Ubuntu MOTU Developers <ubuntu-motu@lists.ubuntu.com>
Original-Maintainer: Debian-Med Packaging Team <debian-med-packaging@lists.alioth.debian.org>
Architecture: amd64
Version: 1:3.67-2
Depends: libc6 (>= 2.4), libx11-6, libxaw7, libxt6
Suggests: phylip-doc
Filename: pool/multiverse/p/phylip/phylip_3.67-2_amd64.deb
Size: 2520650
MD5sum: eacef9de8503a21b90a05bfabea9fbca
SHA1: 61a2ec92c1b0699db07ea08196848e2f41f79a6c
SHA256: 3453f9b3bc9d418bf0c4941eb722e807a96ec32ac3a041df34ee569929bd19dc
Description: [Biology] A package of programs for inferring phylogenies
 The PHYLogeny Inference Package is a package of programs for inferring
 phylogenies (evolutionary trees) from sequences.
 Methods that are available in the package include parsimony, distance
 matrix, and likelihood methods, including bootstrapping and consensus
 trees. Data types that can be handled include molecular sequences, gene
 frequencies, restriction sites, distance matrices, and 0/1 discrete
 characters.
Homepage: http://evolution.genetics.washington.edu/phylip.html
Bugs: mailto:ubuntu-users@lists.ubuntu.com
Origin: Ubuntu
-----------------------------------------------------------------


*HOWEVER*, this only tells you that the application or library is available, not whether it's installed.  To find out whether it's installed, you use *dpkg*.  *dpkg -S pattern* will tell you whether a package containing a file that has that pattern has been installed and what package provided it: ie

-----------------------------------------------------------------
$ dpkg -S ifconfig
net-tools: /sbin/ifconfig
net-tools: /usr/share/man/man8/ifconfig.8.gz
-----------------------------------------------------------------

The '-l' flag has a different meaning, but can also be useful
-----------------------------------------------------------------
dpkg -l |grep -i python |less
 <lots of output - try it>
-----------------------------------------------------------------


There is a GUI application called *synaptic* that provides a more pointyclicky interface but *asrch* and *dpkg* are much faster via the commandline.


To search for all possible applications and libraries on the BBUC nodes using *yum*, it's similar:
-----------------------------------------------------------------
$ yum search lapack
Loading "downloadonly" plugin
Loading "fastestmirror" plugin
Loading mirror speeds from cached hostfile
 * epel: mirror.hmc.edu
 * dag: apt.sw.be
 * atrpms: dl.atrpms.net
 * rpmforge: ftp-stud.fht-esslingen.de
 * base: centos.cogentcloud.com
 * updates: mirrors.usc.edu
 * lscsoft: www.lsc-group.phys.uwm.edu
 * addons: mirror.stanford.edu
 * extras: centos.promopeddler.com
lapack-devel.i386 : LAPACK development libraries
blas-devel.i386 : LAPACK development libraries
lapack.i386 : The LAPACK libraries for numerical linear algebra
blas-devel.i386 : LAPACK development libraries
   <10 lines deleted>
blas.i386 : The BLAS (Basic Linear Algebra Subprograms) library.
lapack-devel.i386 : LAPACK development libraries
blas.i386 : The BLAS (Basic Linear Algebra Subprograms) library.
R-RScaLAPACK.i386 : An interface to perform parallel computation on linear algebra problems using ScaLAPACK

-----------------------------------------------------------------

To find out if a package has been installed:
-----------------------------------------------------------------
$ yum list lapack-devel.i386
Loading "downloadonly" plugin
Loading "fastestmirror" plugin
Loading mirror speeds from cached hostfile
 * epel: mirror.hmc.edu
 * dag: apt.sw.be
 * atrpms: dl.atrpms.net
 * rpmforge: ftp-stud.fht-esslingen.de
 * base: mirrors.xmission.com
 * updates: mirrors.usc.edu
 * lscsoft: www.lsc-group.phys.uwm.edu
 * addons: centos.cogentcloud.com
 * extras: mirror.hmc.edu
Installed Packages
lapack-devel.i386                        3.1.1-1.el5.rf         installed
-----------------------------------------------------------------

Via the Internet
^^^^^^^^^^^^^^^^

Obviously, a much wider ocean to search.  My first approach is to use a Google search constructed of the platform, application name, and/or function of the software.  Something like

-----------------------------------------------------------------
linux image photography hdr 'high dynamic range'  # '' enforces the exact phrase
-----------------------------------------------------------------
which yields http://tinyurl.com/nf5qrn[this page of results.]

Also, don't be afraid to try http://www.google.com/advanced_search?hl=en[Google's Advanced Search] or even http://www.google.com/linux[Google's Linux Search].

After evaluating the results, you'll come to a package that seems to be what you're after, pfstools, for example.  If you didn't find this in the previous searches of the application databases, you can look again, searching explicitly:

-----------------------------------------------------------------
$ashow pfstools
 ...
Description: command line HDR manipulation programs
 pfstools is a set of command line (and two GUI) programs for reading,
 writing, manipulating and viewing high-dynamic range (HDR) images and video
 frames. All programs in this package exchange data using a simple generic
 file format (pfs) for HDR data. It is an attempt to integrate existing file
 formats by providing a simple data format that can be used to exchange data
 between applications.
 ...
-----------------------------------------------------------------
and then you can ask an admin to install it for you.  Typically the apps found in the application repositories lag the latest releases by a few point versions, so if you really need the latest version, you'll have to download the source code or binary package and install it from that package.  You can compile your own version as a private package, but to install it as a system binary, you'll have to ask one of the admins.


Interactive Use
~~~~~~~~~~~~~~~
Logging on to an interactive node may be all that you need.  If you want to slice & dice data interactively, either with a graphical app like http://www.mathworks.com/products/matlab/description1.html[MATLAB], https://wci.llnl.gov/codes/visit/[VISIT], http://jmp.com/[JMP], or http://www.clustal.org/[clustalx], or a commandline app like http://nco.sf.net[nco] or http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html[scut] or even hybrids like http://gnuplot.info/[gnuplot] or http://www.r-project.org/[R], you can run them from any of the interactive nodes, read, analyze and save data to your '/home' directory.  As long as you satisfy the link:#graphics[graphics] requirements, you can view the output of the X11 graphics programs as well.

bash Shortcuts
~~~~~~~~~~~~~~
The bash shell allows an infinite amount of customization and shortcuts via scripts and the 'alias' command.  Should you wish to make use of such things (such as 'nu' to show you the newest files in a directory or 'll' to show you the long ls output in human readable form), you can define them yourself by typing them at the commandline:

-----------------------------------------------------------------
alias nu="ls -lt |head -22" # gives you the 22 newest files in the dir
alias ll="ls -l"   # long 'ls' output
alias llh="ls -lh" # long 'ls' output in human (KB, MB, GB, etc) form
alias lll="ls -lh |less" # pipe the preceding one into the 'less' pager

# for aliases, there can be no spaces between the alias and the start of
# definition: ie 
[myalias = "what it means"] is wrong.  It has to be
--------^^^
[myalias="what it means"] 
-------^^^    
-----------------------------------------------------------------

You can also place all your useful aliases into your '\~/.bashrc' file so that all of them are defined when you log in. Or separate them from the '\~/.bashrc' by placing them into a '\~/.alias' file and have it sourced from your 
'~/.bashrc' file when you log in.  That separation makes it easier to move your 'alias library' from machine to machine.

A collection of such aliases is stashed on the login node at '/usr/local/share/alias.login'

-----------------------------------------------------------------
# copy it to your local dir:
    cp /usr/local/share/alias.login ~/.alias
# and modify it as needed then 'activate' it by 
    source ~/.alias
# or
    . ~/.alias
# or (once you've sourced it once)
    sa  # aliased now to 'source  ~/.alias'
-----------------------------------------------------------------


[[byobu]]
byobu and screen: keeping a session alive between logins
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In most cases, when you log out of an interactive session, the processes associated with that login will also be killed off, even if you've put them in the background (by appending the starting command with '&').  If you regularly need a process to continue after you've logged out, you should submit it to the SGE scheduler with 'qsub' (link:#SGE_batch_jobs[see immediately below]).

However, sometimes it is convenient to continue a long-running process when you have to log out (as when you have to shut down your network connection to take your laptop home).   In this case, you can use the  underappreciated 'screen' program, which establishes a long-running proxy connection on the remote machine that you can detach from and then re-attach to without losing the connection.  As far as the remote machine is concerned, you've never logged off, so your running processes aren't killed off.  When you re-establish the connection by logging in again, you can re-attach to the screen proxy and take up as if you've never been away.

You can also use 'screen' as a terminal multiplexer, allowing multiple terminal sessions to be used from one login, especially useful if you're using Windows with PuTTY that doesn't have a multiple terminal function built into it.

For these reasons, 'screen' by itself is a very powerful and useful utility, but it is admittedly hard to use, even with http://www.catonmat.net/download/screen.cheat.sheet.pdf[a good cheatsheet].  To the rescue comes a 'screen' wrapper called 'byobu' which provides a much easier-to-use interface to the 'screen' utility.  'byobu' has been installed on all the interactive nodes on BDUC and can be started by typing:

-----------------------------------------------------------------
$ byobu
-----------------------------------------------------------------

There will a momentary screen flash as it refreshes and re-displays the login, and then the screen will look similar, except for 2 lines along the bottom that show the screen status.  In the images below, the one at left is 'without byobu'; at right is 'with byobu'.  The 'byobu' screen shows 3 active sessions: 'login', 'claw_1', and 'bowtie'.  The graphical tabs at the bottom are part of the KDE application http://konsole.kde.org/[konsole] which also supports multiplexed sessions (allowing you to multi-multiplex sessions (polyplex?))

image:without_byobu_s.jpg[without byobu]  image:with_byobu_s.jpg[with byobu]

The help screen, shown below,
can always be gotten to by hitting the '<F9>' key, followed by the '<Enter>' key.

-----------------------------------------------------------------
Byobu 2.57 is an enhancement to GNU Screen, a command line
tool providing live system status, dynamic window management,
and some convenient keybindings:

F2    Create a new window    |  F6    Detach from the session
F3    Go to the prev window  |  F7    Enter scrollback mode
F4    Go to the next window  |  F8    Re-title a window
F5    Reload profile         |  F9    Configuration
                             |  F12   Lock this terminal
'screen -r'  - reattach      |  <ctrl-A> Escape sequence
'man screen' - screen's help | 'man byobu'  - byobu's help
-----------------------------------------------------------------

Most usefully, you can create new sessions with the 'F2' key, switch between them with 'F3/F4' and detach from the screen session with 'F6'.

Note that you must have started a 'screen' session before you can detach, so to make sure you're always in a screen session, you can have it start automatically on login by changing the state of the *Byobu currently launches at login* flag (at bottom of screen after the 1st 'F9'.

When you log back in after having detached, type 'byobu' again to re-attach to all your running processes.  If you set 'byobu' to start automatically on login, there will be no need of this, of course, as it will have started.

Note that 'byobu' is just a wrapper for 'screen' and the native 'screen' commands continue to work.  As you become more familiar with 'byobu', you'll probably find yourself using more of the native 'screen' commands.  See this very good http://www.catonmat.net/download/screen.cheat.sheet.pdf[screen cheatsheet].

[[EnvVars]]
Environment Variables
---------------------
Environment variables ('envvars') are those which are set for your session and can be modified for your use.  They include directives to the shell as to which browser or editor you want started when needed, or application-specific paths to describe where some data, executables, or libraries are located.  For example, here is my envvar list, generated by 'printenv':

-----------------------------------------------------------------
hmangala@claw2:~
506 $ printenv                                                                                                                                
MANPATH=/sge62/man:/usr/share/man
XKEYSYMDB=/usr/share/X11/XKeysymDB
HOST=claw2.bduc
KDEWM=kwin
TERM=xterm
SHELL=/bin/bash
XDG_SESSION_COOKIE=e0592749581405c0291e7dff4d6bf8ae-1331569334.933079-1976089657
GTK2_RC_FILES=/home/hmangala/.gtkrc-2.0
CPPFLAGS=-DNC_64BIT_OFFSET=0
SSH_CLIENT=10.255.78.2 35986 22
PERCEUS_HOME=/usr/var/lib/perceus/
SGE_CELL=bduc_nacs
SSH_TTY=/dev/pts/0
sge=/home/hmangala/bduc/trunk/sge
ARC=lx24-amd64
USER=hmangala
LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*.gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35:
SGE_JOB_DIR=/sge62/bduc_nacs/spool/qmaster/jobs
XEDITOR=nedit
MAIL=/var/mail/hmangala
PATH=/home/hmangala/bin:/home/local/bin:/usr/local/sbin:/usr/local/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11R6/bin/u/local/bin:/sge62/bin/lx24-amd64:/sge62/util:/sge62/utilbin/lx24-amd64:/sge62/mpich2/bin:/usr/NX/bin:/usr/X11R6/bin:/sge62/mpich2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/sge62/bin/lx24-amd64:/home/hmangala/bin
PWD=/home/hmangala
_LMFILES_=/usr/share/Modules/modulefiles/sge/6.2
EDITOR=joe
SGE_EXECD_PORT=537
LANG=en_US.UTF-8
LAMRSH=ssh
SGE_QMASTER_PORT=536
MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles
SGE_ROOT=/sge62
LOADEDMODULES=sge/6.2
SVN_SSH=ssh
TACGLIB=/usr/local/lib/tacg
CXX=g++
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
SHLVL=1
HOME=/home/hmangala
LOGNAME=hmangala
CVS_RSH=ssh
SSH_CONNECTION=10.255.78.2 35986 10.255.78.6 22
MODULESHOME=/usr/share/Modules
BROWSER=/usr/bin/konqueror
CC=gcc
module=() {  eval `/usr/bin/modulecmd bash $*`
}
_=/usr/bin/printenv

-----------------------------------------------------------------
Many of these are generated by the bash shell or by system login processes.  Some ones that I set are:

-----------------------------------------------------------------
EDITOR=joe                   # the text editor to be invoked from 'less' by typing 'v'
TACGLIB=/usr/local/lib/tacg  # a data dir for a particular application
XEDITOR=nedit                # my default GUI/X11 editor
BROWSER=/usr/bin/konqueror   # my default web browser 
-----------------------------------------------------------------

Many applications require a set of 'envvars' to define paths to particular libraries or to data sets.  In 'bash', you define an 'envvar' very simply by setting it with an '=':

-----------------------------------------------------------------
# for example, PATH is the directory tree thru which the shell will search for executables
PATH=/usr/bin

# you can append to it (search the new dir after the defined PATH):
PATH=$PATH:/usr/local/bin

# or prepend to it (search the new dir before the defined PATH)
PATH=/usr/local/bin:$PATH

-----------------------------------------------------------------

Note that when you 'assign to' these 'envvars', you use the 'non-$name' version 
and when you use them in bash scripts, you use the '$name' version.  Further, 
in some cases when you use the '$name' version, if it's not clear by context 
what is a variable or not, using braces {} to isolate the name can help ('${name}') 
as well as allowing you to do additional magic with 'parameter expansion' 
(using the braced variable to get values from shell or to perform additional 
work on the variable).  Double parentheses (()) are used to indicate that arithmentic is 
being performed on the variables.  Note that inside the parens, you don't 
have to use the '$name':

-----------------------------------------------------------------
# using $a, $b, & $c in an arithmetic expression: 
$ a=56;  b=35 c=1221
$ echo $((a + b * 4/c))
56

# note this will be integer math, so '56' is returned, not '56.1146601147'
-----------------------------------------------------------------

See http://goo.gl/JvxnT[this bit on stackoverflow] for a longer, but still brief explanation.

[[SGE]]
[[SGE_batch_jobs]]
SGE Batch Submission & Queues
-----------------------------
If you have jobs that are very long or require multiple nodes to run, you'll have to 'submit' jobs to an SGE Queue (aka Q).

*qsub job_name.sh* will submit the job described by 'job_name.sh' to SGE, which will look for an appropriate Q and then start the job running via that Q.  For example, if you need a long running Q, you can request it explicitly: 'qsub -q long job_name.sh' , which will try to run it on the least loaded machine.

Once you log into the login node (via 'ssh -Y <your_UCINetID>@bduc-login.nacs.uci.edu'), you can get an idea of the hosts that are currently up by issuing the *qhost* command. You can find out the status of your jobs with *qstat* alone, which will tell you the status of *your* jobs or

-------------------------------------------------------
qstat -u '*'
-------------------------------------------------------

will tell you the status of all jobs currently queued or running.  A very useful PDF cheatsheet for the SGE 'q' commands http://gridengine.info/files/SGE_Cheat_Sheet.pdf[is here].

To get an overall idea of the status of the entire cluster, type 'bduc_status', which will dump a listing of:

- who's logged into the node
- the top 100 jobs currently running
- nodes/Qs in error state
- overall cluster node usage by Q.


SGE qstat status codes
~~~~~~~~~~~~~~~~~~~~~~

SGE status codes:
[options="header"]
|========================================================================================
|Category   |  State                                         | SGE Letter Code
|Pending    |  pending                                       | qw
|           |  pending, user hold                            | qw
|           |  pending, system hold                          | hqw
|           |  pending, user and system hold                 | hqw
|           |  pending, user hold, re-queue                  | hRwq
|           |  pending, system hold, re-queue                | hRwq
|           |  pending, user and system hold, re-queue       | hRwq
|Running    |  running                                       | r
|           |  transferring                                  | t
|           |  running, re-submit                            | Rr
|           |  transferring, re-submit                       | Rt
|Suspended  |  job suspended                                 |s, ts
|           |  queue suspended                               | S, tS
|           |  queue suspended by alarm                      | T, tT
|           |  all suspended with re-submit                  | Rs, Rts, RS, RtS, RT, RtT
|Error      | all pending states with error                  | Eqw, Ehqw, EhRqw
|Deleted    | all running and suspended states with deletion | dr, dt, dRr, dRt, ds, dS, dT, dRs, dRS, dRT
|========================================================================================

http://impact.open.ac.uk/?q=faq/7[Original table here].


qsub scripts
~~~~~~~~~~~~

The shell script that you submit ('job_name.sh' above) should be written in 'bash' and should completely describe the job, including where the inputs and outputs are to be written (if not specified, the default is your home directory.  The following is a simple shell script that defines 'bash' as the job environment, calls 'date', waits 20s and then calls it again.

-------------------------------------------------------
#!/bin/bash
# (c) 2008 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms.
# This is a simple example of a SGE batch script

# request Bourne shell as shell for job
#$ -S /bin/bash

# print date and time
date
# Sleep for 20 seconds
sleep 20
# print date and time again
date
-------------------------------------------------------

Note that your script has to include (usually at the end) at least one line that executes something - generally a compiled program but it could also be a Perl or Python script (which could also invoke a number of other programs). Otherwise your SGE job won't do anything.

[[keepdatalocal]]
Using qsub scripts to keep data local
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

BDUC depends on a network-shared '/home' filesystem.  The actual disks are in the bduc-login node so users are local to the data when they log in.  However, when you submit an SGE job, unless otherwise specified, the nodes have to read the data over the network and write it back across the network.  This is fine when the total data involved is a few MB, such as is often the case with molecular dynamics runs - small data in, lots of computation, small data out.  However, if your jobs involve 100s or 1000s of MB, the network traffic can grind the entire cluster to a halt.

To prevent this network armaggedon, there is a '/scratch' directory on each node which is writable by all users, but is 'sticky' - the files written can only be deleted by the user who wrote them.

-------------------------------------------------------
$ ls -ld /scratch
drwxrwxrwt 6 root root 4096 Oct 29 18:20 /scratch/
         ^
         + the 't' indicates 'stickiness'
--------------------------------------------------------

If there is a chance that your job will consume or emit lots of data, please use the local /scratch dir to stage your input data, and especially write your output.

This is dirt simple to do.  Since your qsub script executes on each node,
your script should copy the data from your '$HOME dir' to '/scratch/$USER/input' to stage the data, then specify '/scratch/$USER/input' as input, with your application writing to '/scratch/$USER/output_node#'. When the application has finished, copy the output files back to your '$HOME dir' again, and finally cleaning up the '/scratch/$USER/whatever' afterwards.

Here's https://wiki.duke.edu/display/SCSC/Scratch+Disk+Space[another page of information] on using scratch space.


More example qsub scripts
^^^^^^^^^^^^^^^^^^^^^^^^^
- an http://moo.nac.uci.edu/~hjm/bduc/scratchjob.sh[example script] that does data copying to /scratch
- another http://moo.nac.uci.edu/~hjm/bduc/scratch_example_2.sh[example script that uses /scratch] and collates and moves data back to $HOME after it's done.
- http://moo.nac.uci.edu/~hjm/bduc/sleeper1.sh[sleeper1.sh] is a slightly more elaborate one.
- http://moo.nac.uci.edu/~hjm/bduc/fsl_sub[fsl_sub] is a longer, much more elaborate one that uses a variety of parameters and tests to set up the run.
- http://moo.nac.uci.edu/~hjm/bduc/array_job.sh[array_job.sh] is a qsub script that implements an array job - it uses SGE's internal counter to vary the parameters to a command.  This example also uses some primitive bash arithmetic to calculate the parameters.
- http://moo.nac.uci.edu/~hjm/bduc/qsub_generate.py[qsub_generate.py] is a Python script for generating serial qsubs, in a manner similar to the SGE array jobs.  However, if you need more control over your inputs & outputs and /or are more familiar with Python, it may be useful.
- a script that launches http://moo.nac.uci.edu/~hjm/bduc/MPI_suspendable.sh[an MPI script] in a way that allows it to *suspend and restart*.  If you do not write your MPI scripts in this way and try to suspend them, they will be aborted and you'll lose your intermediate data.
(NB: it can take minutes for an MPI job to smoothly suspend; only seconds to restart).


[[stagingdata]]
.Staging data - some important caveats
[IMPORTANT]
==================================================================================
*READING:* Copying data to the remote node makes sense when you have large input data and it has to be repeatedly parsed.  It makes less sense when a lot of data has to be read *once* and is then ignored. (If the data is only read once, why copy it?  Just read it in the script.)  If you stage it to '/scratch', it is still traversing the network once so there is little advantage. (If you have significant data to be re-read on an ongoing basis, contact me and depending on circumstances, we may be able to let you leave it on the '/scratch' system of a set of nodes for an extended period of time.  Otherwise, we expect that all '/scratch' data will be cleaned up post-job.

*WRITING:* Due to the link:#writeperfongl[limitations of the gluster filesystem], where zillions of small writes can result in very bad write performance, it may make sense to stage your *output* to the node-local '/scratch' dir and then copy the output files to their final resting places after the analysis.  This will bypass gluster's bad performance on repeated small writes and instead do a single streaming write after everything has been written (at which gluster is very good).

If it does make sense to stage your data, please try to follow the guidelines below.  If the cluster locks up, offending jobs will be deleted without warning so ask me if you have questions.

*Limit your staging bandwidth* +
If your job(s) are going to require a mass copy (for example, if you submit 20 jobs that each have to copy 1GB), then throttle your job appropriately by using a bandwidth-limiting protocol like 'scp -C -l 2000' instead of 'cp'.  This 'scp' command compresses the data and also limits the bandwidth to ~250KB/s in the above case ('2000' refers to KiloBITS, not KiloBYTES).  'scp' will work without requiring passwords, just like 'ssh' within the cluster.  The syntax is slightly different tho.

-------------------------------------------------------------------------------
# use scp to copy from bduc-login to a local node dir as would be required in a qsub script
scp -C -l 2000 bduc-login:~/my_file /scratch/hmangala
-------------------------------------------------------------------------------

This prevents a few bandwidth-unlimited jobs from causing the available cluster bandwidth to drop to zero, locking up all users.
If you have 'a single job' that will copy a single 100MB file, then don't worry about it; just copy it directly.

Assume the aggregate bandwidth of the cluster is about '50 MB/s'.  No set of jobs should exceed half of that, so if you're submitting 50 jobs, the total bandwidth should be set to no more than 25MB/s or 0.5 MB/s per job or in scp terms '-l 5000'.

*Check the network before you submit a job* +
While there's no way to predict the cluster environment after you submit a job, there's no reason to make an existing BAD situation worse.  If the cluster is exhibiting network congestion, don't add to it by submitting 100 staging jobs. (and if it does appear to be lagging, mailto:harry.mangalam@uci.edu[please let me know])

[[congestion]]
*How to check for cluster congestion* +
On the login node, you can use a number of tools to see what the status is.

- 'bduc_status' will dump a long description detailing who's logged in, what the SGE Q status, including the 1st 100 jobs, any Qs in error state, the Queue Summary, the hosts that are down, and the overall cluster load.

- 'top' give you an updating summary of the top CPU-using processes on the node.  If the top processes include 'nfsd', and the load average is above \~4 with no user processes exceeding 100%, then the cluster can be considered congested. Most users have a multi-colored prompt that shows the current 5m, 10m, & 15m load on the system in square brackets.
-------------------------------------------------------------------------------
Fri Sep 23 14:56:15 [0.13 0.20 0.36]  hjm@bongo:~
617 $ 
-------------------------------------------------------------------------------

(For those that don't have the fancy prompt, you can add it by inserting the following line into your '\~/.profile' or '~/.bashrc'.)
-------------------------------------------------------------------------------
PS1="\n\[\033[01;34m\]\d \t \[\033[00;33m\][\$(cat /proc/loadavg | cut -f1,2,3 -d' ')] \
\[\033[01;32m\]\u@\[\033[01;31m\]\h:\[\033[01;33m\]\w\n\! \$ \[\033[00m\]"
-------------------------------------------------------------------------------

- 'nfswatch' produces a 'top'-like output that can display a number of usage patterns on NFS, including top client by hostname, username, etc.
- 'nethogs' produces a 'top'-like output that shows which processes are using the most bandwidth.
- 'ifstat' will produce a continuous, instantaneous chart of network interface output.
- 'dstat' will produce a similar readout of many system parameters including CPU, memory usage, network, and storage activity.
- 'htop' produces a colored, top-like output that is multiply sortable to debug what's happening with the system.
- 'atop' produces yet another top-like output but highlights saturated systems.  It provides more info to the root user, but is also useful for regular users.
- 'iftop' produces a very useful (but only available to root) text-based, updating diagram of network bandwidth by endpoints.  Mentioned as it might be useful to users on their own machines.
- 'etherape' will produce a graphical ring picture of your network with connections colored by connection type and sized by amount of data flowing thru it.

==================================================================================

Debugging why your job isn't running
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can (at least partially) diagnose your own SGE problems. It may well be that the Qs are set 
up sub-optimally (and if so, we'll try to work with you to optimize them), but you can see very quickly if that's the case or if it's due to a more mundane problem
   
--------------------------------------------------------

$ qstat                  # will give you a list of your SGE jobs
$ qstat -j <job number>  # will give you an exhaustive list of reasons 
                         #  that your job is not executing
--------------------------------------------------------


More example qsub scripts
^^^^^^^^^^^^^^^^^^^^^^^^^

- http://moo.nac.uci.edu/~hjm/bduc/sleeper1.sh[sleeper1.sh] is a slightly more elaborate one.
- http://moo.nac.uci.edu/~hjm/bduc/fsl_sub[fsl_sub] is a longer, much more elaborate one that uses a variety of parameters and tests to set up the run.
- http://moo.nac.uci.edu/~hjm/bduc/array_job.sh[array_job.sh] is a qsub script that implements an array job - it uses SGE's internal counter to vary the parameters to a command.  This example also uses some primitive bash arithmetic to calculate the parameters.
- http://moo.nac.uci.edu/~hjm/bduc/qsub_generate.py[qsub_generate.py] is a Python script for generating serial qsubs, in a manner similar to the SGE array jobs.  However, if you need more control over your inputs & outputs and /or are more familiar with Python, it may be useful.


Current Queue Organization
~~~~~~~~~~~~~~~~~~~~~~~~~~
The batch queues have been to reorganized for clarity.  They now are
organized as follows:

-------------------------------------------------------
Queue        time*      total CPUs   Type
===============================================

long-ics     ( 78 batch cores)
long-adc     ( 64 batch cores)
long         (191 batch cores)


int           2hr          4         interactive (*)

long-ics     240hr         78        batch

long-adc     240hr         64        batch

long-quad    240hr        124        batch (all 4core motherboards)

long         240hr        191        batch


* for the 'int' Q, you have 2 hr of aggregate CPU time (not
    wallclock time).
-------------------------------------------------------

To submit short jobs (<12hr), you can most easily *not* specify a Q - it
will go on any batch Q.  To run on a longxxx Q, either specify the estimated runtime in the
submission script by including the *-l h_rt* parameter

-------------------------------------------------------
#$ -l h_rt=00:30:00 #30 min run
-------------------------------------------------------

(also see below)

or submit specifically to one of the long Qs.

ie:
-------------------------------------------------------
$ qsub -q long-ics yourshellname.sh

# or include the Q spec in the script:

#$ -q long-ics
-------------------------------------------------------


Fixing qsub errors
~~~~~~~~~~~~~~~~~~

Occasionally, a script will hiccup and put your job into an error state.  This can be seen by the qstat *state* output:
-------------------------------------------------------
$ qstat -u '*'

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
   6868 0.62500 simple.sh  hmangala     E     06/08/2009 11:29:02 claws@claw3.bduc                   1
                                       ^^^
-------------------------------------------------------

the *E* (^^^) means that the job is in an *ERROR* state.  You can either delete the job with *qdel*:
-------------------------------------------------------
qdel <Job ID> # deletes the job
-------------------------------------------------------

or often change it's status with the *qmod* command.

-------------------------------------------------------
qmod -cj <Job ID> # clears the error state of the job
-------------------------------------------------------

[[SGE_script_params]]
Some useful SGE script parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When you submit an SGE script, it is processed by 'both bash and SGE'. In order to protect the SGE directives from being misinterpreted by 'bash', they are prefixed by '#$'  This prefix causes bash to ignore the rest of the line (considers it a comment), but allows SGE to process the directive correctly.  

So, the rules are:

- If it's a bash command, don't prefix it at all.
- If it's an SGE directive, prefix it with both characters ('#$').  
- If it's a comment, prefix it only with a '#'. 

//#$ -q long*@a64-*  # run only on these nodes in this Q
 
Here are some of the most frequently used 
-------------------------------------------------------
#$ -N job_name     # this name shows in qstat
#$ -S /bin/bash    # run with this shell
#$ -q long-ics     # run in this Q
#$ -l h_rt=50:00:00  # need 50 hour runtime
#$ -l mem_free=2G  # need 2GB free RAM
#$ -pe mpich 4     # define parallel env
#$ -cwd            # run the job out of the current directory 
                   # (the one from which you ran the script)
#$ -o job_name.out # the name of the output file
#$ -e job_name.err # the name of the error file
#  or
#$ -o job_name.out -j y            # '-j y' merges stdout and stderr

#$ -t 0-10:2       # task index range (for looping); generates 0 2 4..10
#                 Uses $SGE_TASK_ID to find out whether they are task
                  0, 2, 4, 6, 8 or 10

#$ -notify
#$ -M <email> - send mail about this job to the given email address.
#$ -m beas          # send a mail to owner when the job
#                       begins (b), ends (e), aborted (a),
#                       and suspended(s).
-------------------------------------------------------
// #$ -l scr_free=XG  # need X GB scratch space

When a job starts, a number of SGE environment variables are set and are available to the job script.  

Here are most of them:

- ARC - The Sun Grid Engine architecture name of the node on which the job is running; the name is compiled-in into the sge_execd binary
- SGE_ROOT - The Sun Grid Engine root directory as set for sge_execd before start-up or the default /usr/SGE
- SGE_CELL - The Sun Grid Engine cell in which the job executes
- SGE_JOB_SPOOL_DIR - The directory used by sge_shepherd(8) to store jobrelated data during job execution
- SGE_O_HOME - The home directory path of the job owner on the host from which the job was submitted
- SGE_O_HOST - The host from which the job was submitted
- SGE_O_LOGNAME - The login name of the job owner on the host from which the job was submitted
- SGE_O_MAIL - The content of the MAIL environment variable in the context of the job submission command
- SGE_O_PATH - The content of the PATH environment variable in the context of the job submission command
- SGE_O_SHELL - The content of the SHELL environment variable in the context of the job submission command
- SGE_O_TZ - The content of the TZ environment variable in the context of the job submission command
- SGE_O_WORKDIR - The working directory of the job submission command
- SGE_CKPT_ENV - Specifies the checkpointing environment (as selected with the qsub -ckpt option) under which a checkpointing job executes
- SGE_CKPT_DIR - Only set for checkpointing jobs; contains path ckpt_dir (see the checkpoint manual page) of the checkpoint interface
- SGE_STDERR_PATH - The path name of the file to which the standard error stream of the job is diverted; commonly used for enhancing the output with error messages from prolog, epilog, parallel environment start/stop or checkpointing scripts
- SGE_STDOUT_PATH - The path name of the file to which the standard output stream of the job is diverted; commonly used for enhancing the output with messages from prolog, epilog, parallel environment start/stop or checkpointing scripts
- SGE_TASK_ID - The task identifier in the array job represented by this task
- ENVIRONMENT - Always set to BATCH; this variable indicates that the script is run in batch mode
- HOME - The user's home directory path from the passwd file
- HOSTNAME - The host name of the node on which the job is running
- JOB_ID - A unique identifier assigned by the sge_qmaster when the job was submitted; the job ID is a decimal integer in the range to 99999
- JOB_NAME - The job name, built from the qsub script filename, a period, and the digits of the job ID; this default may be overwritten by qsub -N
- LOGNAME - The user's login name from the passwd file
- NHOSTS - The number of hosts in use by a parallel job
- NQUEUES - The number of queues allocated for the job (always 1 for serial jobs)
- NSLOTS - The number of queue slots in use by a parallel job

The above was extracted from http://www.cbi.utsa.edu/sge_tutorial[this useful page].
For more on SGE shell scripts, http://nbcr.sdsc.edu/pub/wiki/index.php?title=Sample_SGE_Script[see here].

For a sample SGE script that uses mpich2, link:#mpich2script[see below]

Where do I get more info on SGE?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Oracles purchase of Sun has resulted in a major disorganization of SGE (now OGE) documentation.  If a link doesn't work, it may be because of this kerfuffle.  tell me if a link doesn't work anymore and I'll try to fix it.

 * The ROCKS group has a http://www.rocksclusters.org/rocksapalooza/2006/lab-sge.pdf[very good SGE Introduction] from the User's perspective.  Ignore the ROCKS-specific bits.
 * http://www.google.com/search?hl=en&q=Sun+Grid+Engine&btnG=Search[Google Sun Grid Engine] is a good, easy start.  Maybe you'll be lucky.. :)
 * http://gridengine.info/[Chris Dagdigian's SGE site] is very good and has an http://wiki.gridengine.info/wiki/index.php?Main_Page[excellent wiki]
 * The official http://www.oracle.com/technetwork/oem/grid-engine-166852.html[Sun (now Oracle) Grid Engine site] has a lot
    of good links.
 * The http://wikis.sun.com/display/sungridengine/Home[SGE docs] are the final
   word, but there are a lot of pages to cover.


If you need to run an MPI parallel job, you can request the needed resources by Q as well by specifying the resources inside the shell script (more on this later) or externally via the -q and -pe flags (type 'man sge_pe' on one of the BDUC nodes).


Special cases
-------------

Editing Huge Files
~~~~~~~~~~~~~~~~~~
In a word, *don't*.  Many research domains generate or use multi-GB text files. Prime offenders are log files and High-Thruput Sequencing files such as those from Illumina. These are meant to be processed programmatically, not with an interactive editor. When you use such an editor, it typically tried to load the entire thing into memory and generates various cache files.  (If you know of a text editor that handles such files without doing this, please let me know.)

Otherwise, use the utilities http://goo.gl/6kBwR[head], http://goo.gl/ISdl2[tail], http://goo.gl/3vB04[grep], http://goo.gl/PQY80[split], http://goo.gl/nDbu[less], http://goo.gl/nZwOX[sed], and http://goo.gl/r8YOc[tr], possibly in combinations with http://goo.gl/TkFSc[Perl]/http://goo.gl/Vjqc[Python] to peek into such files and or change them.  

http://en.wikipedia.org/wiki/Grep[grep] especially is one of the most useful tools for text processing you'll ever use.


For example, the following command starts at 2,000,000 lines into a file and stops at 2,500,000 lines and shows that range in the 'less' pager.

---------------------------------------------------------------------
$ perl -n -e 'print if ( 2000000 .. 2500000)' humongo.txt | less
---------------------------------------------------------------------

In addition, please use the commandline utilities http://goo.gl/WQGhy[gzip/gunzip], http://goo.gl/baoIB[bzip2], http://goo.gl/VpiyQ[zip], http://goo.gl/7sdXN[zcat], etc instead of the http://goo.gl/b2828[ark] graphical utility on such files. 'ark' apparently tries to store everything in RAM before dumping it.


NAMD scripts
~~~~~~~~~~~~

http://www.ks.uiuc.edu/Research/namd/[namd] is a molecular dynamics application that interfaces well with http://www.ks.uiuc.edu/Research/vmd/[VMD]. Both of these are available on BDUC - see the output of the 'module avail' command. 

The 'qsub' scripts to submit 'namd 2.7' jobs to the SGE Q'ing system are a bit tricky due to the way early 'namd' is compiled - the specification of the worker nodes is provided by the 'charmrun' executable and some complicated additional files supplied with the 'namd' package.  This means that 'namd2.7x' is more complicated to set up and run than 'namd2.8x'.  The 'qsub' scripts are provided separately below.

namd 2.7x
^^^^^^^^^
namd 2.7x runs require the complicated mechanism described here.
http://moo.nac.uci.edu/~hjm/bduc/namd2.7_sge_submit.sh[namd2.7_sge_submit.sh] is a SGE submission script that runs successfully on BDUC if given a valid 'namd' input file.
It must be submitted to SGE as follows:

---------------------------------------------------------
qsub -pe openmpi <#cores> <name_of_script>
# or explicitly, for an 8core job
qsub -pe openmpi 8 namd_sge_submit.sh
# or the '-pe specification can be provided inside the file as well.  
#See the example above.
---------------------------------------------------------
'(thanks to Chad Cantwell for the hints and pointer to the http://www.ks.uiuc.edu/Training/Workshop/Cluster/files/using_rocks.html[original page])'

namd 2.8x
^^^^^^^^^
namd 2.8x uses a simpler calling convention, altho it also uses the 'charmrun' wrapper.  The script above has been modified to use the namd 2.8x approach and is http://moo.nac.uci.edu/~hjm/bduc/namd2.8_sge_submit.sh[available here as namd2.8_sge_submit.sh].  Note the final application call is much simpler.  In most cases, you should use the 2.8x or later version.

It is submitted to the SGE scheduler as above.


SATe
~~~~

.SATe is ONLY available on Claw nodes
[NOTE]
==================================================================================
Until we get a better sense of SATe popularity, it and its requisite tools are only
available on the Claw nodes.  You can log into Claw1 directly
('ssh -Y <your_UCINetID>@bduc-claw1.nacs.uci.edu') and from there, to any of the other Claw nodes.
If you are going to run a job that will take more than a 10 minutes, we 'INSIST' that you
run it under SGE so that the nodes don't get oversubscribed.  How to write and submit
an SGE script link:#SGE_batch_jobs[is described here].

Note that you will have to run on the 'claws queue.' ie Your qsub script will have
include the SGE directive:

*#$ -q claws*
==================================================================================


http://phylo.bio.ku.edu/software/sate/sate.html[SATe] is a Python wrapper around a number of Phylogenetic tools.  It, along with its requisite tools (ftp://ftp.ebi.ac.uk/pub/software/clustalw2/[ClustalW2], http://align.bmr.kyushu-u.ac.jp/mafft/software/[MAFFT], http://www.drive5.com/muscle/[MUSCLE], http://opal.cs.arizona.edu/[OPAL], http://www.ebi.ac.uk/goldman-srv/prank/prank/[PRANK], http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm[RAxML]) are installed in the shared '/usr/local/bin' directory of the Claw nodes.

The test cases work with the default settings, but if you want to change any parameters, you have to edit the configuration file and feed it to 'run_sate.py' with '-c' as shown below.

-----------------------------------------
export SH=/usr/local/bin # <- shortens the following lines considerably

run_sate.py -c sate.cfg  -i $SH/sate_data/small.fasta -t $SH/sate_data/small.tree -j test
            ^^^^^^^^^^^
-----------------------------------------

You can name the  configuration file anything you like, but it has a specific format.  Especially, do 'NOT' try to start comments anywhere except the 1st character of a line, and then only beginning with '#'.

Here is http://moo.nac.uci.edu/~hjm/sate.cfg[a good SATe configuration file] to start from.

Here is http://moo.nac.uci.edu/~hjm/sate.bad.cfg[the same SATE configuration file with some bad comments] (marked as such with 'BAD COMMENT' in the offending line.)

Here is http://moo.nac.uci.edu/~hjm/SATE.sh[an example qsub submission script for SATe].  Submit to SGE as:

----------------------------------------------
qsub SATE.sh
----------------------------------------------

R on BDUC
~~~~~~~~~

http://www.r-project.org[R] is an object-oriented language for statistical computing, like SAS (see below).  It is becoming increasingly popular among both academic and commercial users to the extent that it was http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html[noted in the New York Times] in early 2009.  For a very simple overview with links to other, better resources, see http://moo.nac.uci.edu/~hjm/AnRCheatsheet.html[this link]

There are multiple versions of R on BDUC, and they do not all behave identically.  Since we have a split cluster (most nodes (~80; ~160 cores) run http://www.centos.org[CentOS] (http://www.redhat.com[RedHat]-based) ; the 4 claw nodes (16 cores) run a version of http://www.ubuntu.org[Ubuntu], (http://www.debian.org[Debian]-based).  Because of slightly different library structures and versions, some R add-ons don't work across the subclusters, so in those situations, we concentrate on getting the 'standard' approach working on the CentOS nodes, and provide work-arounds on the claw nodes.

The module system provides R versions '2.10.0' and '2.8.0' for all nodes.  Additionally, the claw nodes provide version '2.9.0' because it is the default version.  Finally, I've added the 'R development' version which is automatically downloaded, compiled, and re-installed every night from the R archives.  This is the VERY LATEST version, so new that it (infrequently) fails.  Howeer, if you need the latest and greatest version, it's available.  To load any of these versions, inquire what the available versions are with 'module avail' and then use the appropriate 'module load R/<version>' to set up the paths.

For most things, everything works identically.  The things that don't usually have to do with parallel processing in R and the underlying http://en.wikipedia.org/wiki/Message_Passing_Interface[Message Passing Interface] (MPI) technology:

- http://cran.r-project.org/web/packages/Rmpi/index.html[Rmpi] should work on all CentOS nodes with version 2.10.0.  The claw nodes will not work with the 2.10.0 version as it has a complicated lib dependency that leads into some very bushy areas.  Rmpi DOES work on the claw nodes, but only under R 2.9.0 (the default).

- http://cran.r-project.org/web/packages/rsprng[rsprng] (R's wrapping of http://sprng.cs.fsu.edu[SPRNG]) is available on all the CentOS nodes for R/2.10.0 and on the default 2.9.0 version on the
claw nodes.

- http://cran.r-project.org/web/packages/snow/[snow] and http://cran.r-project.org/web/packages/snowfall/[snowfall] are available on the CentOS nodes with version 2.10.0 and on the claw nodes with the default 2.9.0 version.

[[sas93]]
SAS 9.3 for Linux
~~~~~~~~~~~~~~~~~

We have a single node-locked license for SAS 9.3 on claw1, a 4core Opteron node with 32GB RAM.  While the license is for that node only, as many instances of SAS can be run as there is RAM for it.

To start SAS on claw1, first ssh directly to claw1:
-------------------------------------------------------
ssh -Y <Your_UCINETID>@bduc-claw1.nacs.uci.edu

# then change directories (cd) to where your data is
cd /dir/holding/data

# and start SAS
sas
-------------------------------------------------------

This will start an X11 SAS session, opening several windows on your monitor (as long as you have an active X11 server running).  If you're connecting from Mac or Windows, link:#graphics[please see this link].

You can use the SAS program editor (one of the windows that opens automatically), or use any other editor you want and paste or import that code into SAS.  The combination of http://www.gnu.org/software/emacs/[emacs] and http://ess.r-project.org/[ESS (Emacs Speaks Statistics)] is a very powerful combination.  It's mostly targeted to the R language, but it also supports SAS and Stata.

http://www.nedit.org[Nedit] also has a http://www.nedit.org/ftp/contrib/highlighting/sas.1.0.pats[template file for SAS].

.To use Java (ods graphics)
[NOTE]
==================================================================================
SAS 9.3 uses Java for at least some of its plotting routines (the 'ods graphics').

The 64b version of SAS that we use on claw1 still uses the 32b version of Java which needs the environment vars set to tell SAS where to find things, so if you are going to use SAS on claw1, please add the following to your '~/.bashrc' file:

-------------------------------------------------------
# convenience shortcut
export SASPATH=/home/apps/SAS-x86_64/9.3

# following is required to allow 32bit java to find its libs
export LD_LIBRARY_PATH=${SASPATH}/jre1.5.0_21/lib/i386:\
${SASPATH}/jre1.5.0_21/lib/i386/server:${LD_LIBRARY_PATH}

# Need to set the CLASSPATH to the JRE root so when SAS calls java, the right executable is called.
export JAVAHOME=${SASPATH}/jre1.5.0_21/
-------------------------------------------------------
==================================================================================


Parallel jobs
~~~~~~~~~~~~~

BDUC supports several http://en.wikipedia.org/wiki/Message_Passing_Interface[MPI] variants.

MPICH2
^^^^^^

BDUC is running MPICH2 version 2-1.1.1p1.  Using it is not hard, but requires a few things:

- To compile MPI binaries, you'll have to link:#modules[module load] the MPICH2 environment:
----------------------------------------------------------------
module load mpich2
----------------------------------------------------------------

- You need to set up link:#passwordless_ssh[passwordless ssh] so that
you can ssh to any BDUC node without entering a password, including
editing your *~/.ssh/config* file to prevent 1st-time connection
warnings  from interrupting your jobs

- you need to create the file *~/.mpd.conf*, as below:
[NOTE]
==================================================================================
From *Dec. 15th, 2009* onwards, the '.mpd.conf' is set up for you automatically when your
account is activated, so you no longer have to do this manually.  However, as a
reference for those of you who want to set it up on other machines, I'll leave
the documentation in place.
==================================================================================

----------------------------------------------------------------
cd
# replace 'thisismysecretpassword' with something random.
# You won't have to remember it.
echo "MPD_SECRETWORD=thisismysecretpassword" >.mpd.conf
chmod og-rw .mpd.conf
----------------------------------------------------------------

- your mpich2 qsub scripts have to include the 2 following lines in order to allow
SGE to find the PATHS to executables and libraries
----------------------------------------------------------------
module load mpich2
export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID"
----------------------------------------------------------------

[[mpich2script]]
A full MPICH2 script is shown below.  Note the '#$ -pe mpich2 8' line which
sets up the MPICH2 parallel environment for SGE and requests 8 slots (CPUs).
(see link:#SGE_script_params[above] for more SGE script parameters)
----------------------------------------------------------------
#!/bin/bash
# good idea to be explicit about using /bin/bash (NOT /bin/sh).
# Some Linux distros symlink bash -> dash for a lighter weight
# shell, which works 99% of the time but causes unimaginable pain
# in those 1% occassions.

# Note that SGE directives are prefixed by '#$' and plain comments are prefixed by '#'.
# Text after the '<-' should be removed before executing.

#$ -q long    <- the name of the Q you want to submit to
#$ -pe mpich2 8    <- load the mpich2 parallel env and ask for 8 slots
#$ -S /bin/bash    <- run the job under bash
#$ -M harry.mangalam@uci.edu <- mail this guy ..
#$ -m bea          <- .. when the script (b)egins, (e)nds, or (a)borts
#$ -N cells500     <- name of the job in the qstat output
#$ -o cells500.out <- name of the output file.
#
module load mpich2              <- load the mpich2 environment
export MPD_CON_EXT="sge_$JOB_ID.$SGE_TASK_ID" <- this is REQUIRED for SGE to set it up.
module load neuron              <- load another env (specific for 'neuron')
export NRNHOME=/apps/neuron/7.0 <- ditto
cd /home/hmangala/newmodel      <- cd to this dir before executing
echo "calling mpiexec now"      <- some deugging text
mpiexec -np 8 nrniv -mpi -nobanner -nogui /home/hmangala/newmodel/model-2.1.hoc
# above, start the job with 'mpiexec -np 8', followed by the executable command.
----------------------------------------------------------------


MATLAB
~~~~~~

MATLAB can be started from the login node by typing 'matlab'.  This will log you into a 64bit interactive node and start the MATLAB Desktop.
'matlabbig' will start an interactive session on one of the claw nodes (32GB RAM).

We have 3 licenses for interactive MATLAB on the BDUC cluster.  Those licenses are decremented from the campus MATLAB license pool.  They are meant for running interactive, relatively short-term MATLAB jobs, typically less than a couple hours.  If they go longer than that, or I see that you've launched several MATLAB jobs, they are liable to be killed off. 

If you want to run long jobs using MATLAB code, the accepted practice is to compile your MATLAB '.m' code to a native executable using the MATLAB compiler 'mcc' and then submitting that code, along with your data to a batch Q (see above for submitting batch jobs).  This approach does not require a MATLAB license, so you can run as many instances of this compiled code for as long as you want without impacting the campus licenses.

The official mechanics of doing this http://tinyurl.com/nebw3e[is described here].

Some additional notes from someone who has done this link:#matlabcompiler[is in the Appendix].

MATLAB Alternatives
~~~~~~~~~~~~~~~~~~~
There are a number of MATLAB alternatives, the most popular of which are available on BDUC.  Since these are Open Source, they aren't limited in the number of simultaneous uses, altho you should always try to run batch jobs in the SGE queue if possible. http://moo.nac.uci.edu/~hjm/ManipulatingDataOnLinux.html#MathModel[See this doc for an overview of them and further links].


Hadoop
~~~~~~
http://hadoop.apache.org/[Hadoop] is a Java-based framework for running large-grained, parallel jobs on clusters.  It now encompasses a large number of subprojects, but it is usually used with the http://hadoop.apache.org/mapreduce/[MapReduce] approach. It scales very well, but it is complex to run on BDUC because it requires its own filesystem and scheduler.  Since on BDUC (and other general-purpose clusters which are not dedicated to hadoop full-time) the job scheduling is more general-purpose, we have to run it as a meta-job. That is, you submit a request to SGE to allocate a number of nodes on which to run hadoop; SGE allocates them to hadoop; hadoop sets up the logical structures it needs on those allocated nodes and everyone's happy.  We run hadoop under http://myhadoop.sourceforge.net/[myHadoop] a small bit of middleware designed to handle the interactions between SGE and Hadoop.

Note that Hadoop initializes its own filesystem on the existing /scratch directory for each node.  Because Hadoop can end up storing many GBs of data, we have set up a dedicated hadoop SGE Q named (surprise) 'hadoop'.  All the nodes in this Q should have >300GB available and you can test this by running the following command on the login node:

--------------------------------------------------------------------
cf --config=/usr/local/bin/cfrc --target=HADOOP 'df -h |grep sda3 | scut --c1=3'
--------------------------------------------------------------------
This will start an interative script that will create a subdir in the current directory which will contain files named for all the hadoop nodes which lists the free diskspace on /scratch.  You can see the results by doing this:

--------------------------------------------------------------------
cd REMOTE_CMD-df--h--grep-sda3---s-<timestamp> # timestamp changes obviously
grep G *
a64-141:334G
a64-142:337G
..
..
--------------------------------------------------------------------

An SGE submit script for Hadoop  http://moo.nac.uci.edu/~hjm/bduc/hadoop_example_qsub.sh[is here].

The usual way to exploit Hadoop is to write your application in Java, typically wrapping it into a 'jarfile'.  This Java requirement is not absolute tho. You can also write your application in http://www.jython.org/[Jython] (Python written in Java) and therefore essentially write Java using Python.  You can also write your hadoop app in pure Python (or even in C++).  http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/[Here is a page describing that approach.] 

The hadoop installation can be found at $HADOOP_HOME. By referring to this variable, you can, for example, add standard libraries to the class path of your java application (e.g. Class-Path: $HADOOP_HOME/lib/hadoop-0.20.2-tools.jar).

(Thanks to Fabian Lindenberg for helping to set up and debug Hadoop on BDUC.)


GPUs
~~~~
Thanks to Dr. Steve Jenks, the 'claw8' node contains 2 Nvidia C1060 Graphics Processing Units (http://en.wikipedia.org/wiki/GPU[GPUs]), each with 240 cores.  These cores are specialized to do http://en.wikipedia.org/wiki/SIMD[SIMD] tasks very fast. For example, if your code supports that kind of processing, you can get 10-100X speedup for those parts of the code.  You can learn more about programming these GPUs with http://en.wikipedia.org/wiki/CUDA[CUDA] at the local docs or via the http://developer.nvidia.com/cuda-toolkit-40[more up-to-date docs at NVIDIA].

In order to use the GPUs, you will have to be registered to use the GPU SGE Queue (contact harry.mangalam@uci.edu) and of course will have to provide your own code, altho the entire Nvidia SDK examples are compiled and available locally: source code at '/apps/gpu/1.0/NVIDIA_GPU_Computing_SDK/C/src/', compiled biaries at '/apps/gpu/1.0/NVIDIA_GPU_Computing_SDK/C/bin/'.  Of course, since only 'claw8' has the GPUs installed, you'll only be able to run them there.

In order to run the compiled GPU code for long runs, you'll have to submit them thru the SGE scheduler using the 'gpu' queue which is initialized by using the *module load gpu* directive in your qsub script.  We expect that you'll do your debugging on your own machine altho you can do it on claw8 after you have registered and been added to the 'gpu' group.

We currently have the CUDA Toolkit 4.0 installed and will try to remain current with NVIDIA upgrades. 


[[graphics]]
Graphics
--------
All the interactive nodes will have the full set of X11 graphical tools and libraries. However, since you'll be running remotely, any application that requires OpenGL, while it will probably run, will run so slowly that you won't want to run it for long.  If you have an application that requires OpenGL, you'll be much better off downloading the processed data to your own desktop and running the application locally.

If you connect using Linux
~~~~~~~~~~~~~~~~~~~~~~~~~~
In order to have access to these X11 tools via Linux, your local Linux must have the X11 libraries available. Unless you have explicitly excluded them, all modern Linux distros include X11 runtime libraries.  Don't forget to use the the '-Y' flag when you connect using ssh to tunnel the X11 display back to your machine:
-----------------------------------------------------------------------
ssh -Y your_UCINetID@bduc.nacs.uci.edu
-----------------------------------------------------------------------

If you connect using MacOSX
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The MacOSX installation DVDs come with a free, Apple-certified X11 installation. On Leopard it's in *Optional Installs -> Optional Installs.mpkg*   All you have to do is install it and start it running in the background to accept the X11 windows (*Applications -> Utilities -> X11*)
Ditto the '-Y' ssh flag as above.

[[XonWin]]
If you connect using Windows
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There are quite a few ways to use a Linux system besides logging into it directly from the console. 

- remote shell access, using http://www.chiark.greenend.org.uk/~sgtatham/putty/[PuTTY], a free ssh client, which even allows X11 forwarding so that you can use it with Xming (below) to view Graphical apps from BDUC. 'Putty' is a straight ssh terminal connection that allows you to securely connect to the Linux server and interact with it in a purely text-based basis. For a shell/terminal cognoscenti, it's considerably less capable than any of the terminal apps (konsole, eterm, gnome-terminal, etc) that come with Linux, but it's fine for establishing the 1st connection to the Linux server. If you're going to run anything that requires an X11 GUI, you'll need to set PuTTY to do X11 forwarding.  To enable this, double-click the PuTTY icon to bring up the PuTTY configuration window. On the left Pane, follow the clickpath: 'Connection -> SSH -> X11 -> set the Enable X11 Forwarding'. After setting this, click on Session at top of the pane, and set a name in 'Saved Sessions' on lower right pane, click the [Save] button to save the connection information so that the next time you need to connect, the correct setting will already be set.  You can customize PuTTY with a number of add-ons and config tweaks, http://www.thegeekstuff.com/2008/08/turbocharge-putty-with-12-powerful-add-ons-software-for-geeks-3/[some of which are described here.]


- http://sourceforge.net/projects/xming/[Xming], a lightweight and free X11 server (client, in normal terminology). Xming provides 'only the X server', as opposed to 'Cygwin/X' below.  Xming provides the X server that displays the X11 GUI information that comes from the Linux machine. When started, it looks like it has done nothing, but it has started a hidden X11 window (note the Xming icon in the toolbar). When you start an X application on the Linux server (after logging in with PuTTY as described above), it will accept a connection from the Linux machine and display the X11 app as a single window that looks very much like a normal MS WinXP window. You'll be able to move it around, minimize it, maximize it and close it by clicking on the appropriate button in the title bar. There may be a slight lag in response in that window, but over the University network, it should be be acceptable.

- if you have trouble setting up Putty and Xming, please see  http://www.math.umn.edu/systems_guide/putty_xwin32.html[this page which describes it in more detail, with screenshots]

- http://x.cygwin.com/[Cygwin/X], another free, but much larger and capable X server (combined with an entire Linux-on-Windows implementation). Provides much more power and requires much more user configuration than Xming.  Cygwin/X provides not only a free Xserver but nearly the entire Linux experience to Windows. This is more than what most normal users want (both in diskspace and configuration), especially if you have a real Linux server to use. The X11 server is very good tho, as you might expect.


- http://www.realvnc.com/[VNC server and client]. Run the server on the Linux machine and connect to it with the client running on your Windows Desktop. Can provide the entire Linux Desktop experience on your Windows machine, altho with less graphics performance (it's fine to connect to a machine on the university network, but slow across the Internet).  VNC is mechanism that can present the entire Linux Desktop to the user, including not only the application windows, but the Desktop itself, with all the bells and whistles that that metaphor provides. The RealVNC package for Windows provides both the Viewer and the Server, so you can provide remote access to your Windows Desktop as well.  This can be especially useful if you're trying to demo a Desktop application to others - you can configure the VNC server to allow multiple read-only clients (they can't take control of your desktop) to watch you run the app. Combined with the multiplaform VOIP application Skype which can run on the same machine, you have a very cheap tele-screensharing setup good for demo'ing applications. The Windows VNC server is efficient enough to support at least 10 viewing clients and the refresh rate is good enough for a mostly 2D demo across UC.


- http://nomachine.com/[NoMachine] http://www.nomachine.com/download.php[Server and Clients], a system much like the VNC system but much more efficient and therefore has better performance. It is also more complicated to set up. Please read http://www.linux.com/archive/feature/116354[this review] for an overview of what is required and how to install it.   For personal use, there is a free server and client. For more connections, you'll need a commercial license.  There are also 2 free NX servers: http://freenx.berlios.de/[FreeNX] and Google's recently released http://code.google.com/p/neatx/[NeatX], both of which are fairly easy to install and allow unlimited connections.  BDUC uses both the free versions and it is described in more detail immediately below.


[[nomachine]]
NoMachine NX connections
~~~~~~~~~~~~~~~~~~~~~~~~
We've added GPL'ed NXservers to the 'claw1' and claw6 nodes. Both have direct external connections.  With the 'nxclient' software installed on your machine, you can run the entire KDE or GNOME Desktop with remarkable speed.  Here's a screenshot of my laptop screen with the nxclient Desktop from claw1 running SAS, matlab, and tablet (a genomics assembly viewer):

image:nxclient_desktop_ss.png[nxclient desktop shot] 


Get the appropriate client software for your platform http://www.nomachine.com/download.php[here].

Configuring the nxclient
^^^^^^^^^^^^^^^^^^^^^^^^
The configuration is fairly simple.  The initial pane allows you to set your 'Login' (your UCINetID and 'Password' (your UCINetID password) and name the session anything you like.

image:nxclient_screen1.png[nxclient screen1] 

Clicking the 'Configure...' button takes you to a set of tabbed configuration pages.  The only one that needs to be modified is the 1st one 'General':

image:nxclient_config_general_kde.png[nxclient general kde config]

The screenshot above shows the setup for logging into the claw1 node (which supports the KDE Desktop).  If you want to use it on the login node which supports the Gnome Desktop, see below:

image:nxclient_config_general_gnome.png[nxclient general gnome config]

'DO NOT' change the default Key unless you have problems logging into the 'login' node.

.Key Changes with 'bduc-login'
[NOTE]
==================================================================================
When bduc-login was upgraded, the ssh keys used to validate the 'nx' user (who inits nxserver) changed.  If you use the nxclient with bduc-login, you'll have to change the nxclient key to this one: (also on the bduc-login node in: '/usr/local/share/nxclient_dsa_key')
-----------------------------------------------------------------------------
-----BEGIN DSA PRIVATE KEY-----
MIIBuwIBAAKBgQCXv9AzQXjxvXWC1qu3CdEqskX9YomTfyG865gb4D02ZwWuRU/9
C3I9/bEWLdaWgJYXIcFJsMCIkmWjjeSZyTmeoypI1iLifTHUxn3b7WNWi8AzKcVF
aBsBGiljsop9NiD1mEpA0G+nHHrhvTXz7pUvYrsrXcdMyM6rxqn77nbbnwIVALCi
xFdHZADw5KAVZI7r6QatEkqLAoGBAI4L1TQGFkq5xQ/nIIciW8setAAIyrcWdK/z
5/ZPeELdq70KDJxoLf81NL/8uIc4PoNyTRJjtT3R4f8Az1TsZWeh2+ReCEJxDWgG
fbk2YhRqoQTtXPFsI4qvzBWct42WonWqyyb1bPBHk+JmXFscJu5yFQ+JUVNsENpY
+Gkz3HqTAoGANlgcCuA4wrC+3Cic9CFkqiwO/Rn1vk8dvGuEQqFJ6f6LVfPfRTfa
QU7TGVLk2CzY4dasrwxJ1f6FsT8DHTNGnxELPKRuLstGrFY/PR7KeafeFZDf+fJ3
mbX5nxrld3wi5titTnX+8s4IKv29HJguPvOK/SI7cjzA+SqNfD7qEo8CFDIm1xRf
8xAPsSKs6yZ6j1FNklfu
-----END DSA PRIVATE KEY-----
-----------------------------------------------------------------------------

On your nxclient, the click-path is: [Configure] -> [Key] -> (delete current key) ->
(paste in the contents of the above text box including the BEGIN & END lines.)

Then [Save] -> [Save] -> [OK] -> [Login]
==================================================================================

Unlike the commercial NoMachine NXserver, these servers allow any number of connections. There may be a fairly long wait (up to a minute) before the session is initially validated and the screen comes up (the Desktop is loading on the server), but after that, the interaction is very fast.

Terminating the session
^^^^^^^^^^^^^^^^^^^^^^^
Note that when you close the session, you have 2 options - to 'Disconnect' (closes the client but leaves the session running so you can reconnect to the same session you left) or 'Terminate' (closes the client and kills the session, so you'll start from a new Desktop instance).

Unless there is good reason to keep it running, please *Terminate the session* to free up resources.  If you're running a long-running process, you can *Disconnect* to be able to *Re-connect* but remember to *Terminate* when it's finished.

I have run into a situation whereby if the session is ended oddly it will leave a 'ghost session' (ie, you kill the nxclient by killing the shell from which it was started).  When you next start up the nxclient, it will offer to let you re-connect to an existing session, but then be unable to reconnect.  If this happens, you should still be able to start a 'new' session, but please call me to address that situation - I have to manually remove the ghost session credentials in '/usr/local/var/lib/neatx/sessions/' or '/var/lib/neatx/sessions', depnding on how it was set up.

If you cannot establish a reconnection or a new connection, you may have too many ghost sessions and in that case, you should attempt to clean house by killing all your nx sessions:

---------------------------------------------------------------------
# find your nx processes 
# the search term 'n[x]' prevents ps from finding the grep process itself
$ ps aux |grep n[x] |grep hmangal                                                                                                                                              
hmangala  9646  0.0  0.0  92260  3952 pts/18   Ss+  11:43   0:00 /bin/su hmangala -c cd && /usr/local/lib/neatx/nxserver --proto=3030000 -- hmangala
hmangala  9670  0.0  0.0  10824  1596 pts/18   S+   11:44   0:00 bash -c cd && /usr/local/lib/neatx/nxserver --proto=3030000 -- hmangala
hmangala  9711  0.4  0.0  58176 10260 pts/18   S+   11:44   0:00 /usr/bin/python /usr/local/lib/neatx/nxserver --proto=3030000 -- hmangala
hmangala 20600  0.0  0.0  10748  1500 ?        S    Aug03   0:00 /bin/bash /usr/local/lib/neatx/nxnode-wrapper hmangala DABD5CA430357900E0F012B7543A8034
hmangala 20603  0.0  0.0  70676 10696 ?        Sl   Aug03   0:00 /usr/bin/python /usr/local/lib/neatx/nxnode hmangala DABD5CA430357900E0F012B7543A8034
hmangala 20604  0.0  0.0  10748   708 ?        S    Aug03   0:00 /bin/bash /usr/local/lib/neatx/nxnode-wrapper hmangala DABD5CA430357900E0F012B7543A8034
hmangala 20608  0.0  0.2 150696 67312 ?        S    Aug03   1:45 /usr/bin/nxagent -D -name Neatx - hmangala@claw1.bduc:430 - claw1 -options /usr/local/var/lib/neatx/sessions/DABD5CA430357900E0F012B7543A8034/options -nolisten tcp :430

#        ^^^^^  the PIDs you have to kill

# and kill them off
$ kill -9  9646  9670  9711 20600 20603 20604 20608

# now try to establish an nx connection.
---------------------------------------------------------------------


.Further oddities with 'nxclient'
[NOTE]
==================================================================================
In the myriad configurations possible, there are a few other notes that may be useful.

If you can use the nxclient to log into a system and it appears to accept mouse input but no longer accepts keyboard input, you may need to define the 'XKEYSYMDB' environment variable to explicitly point to the bduc node's 'XKeysymDB' file.  Add the following line to your '~/.bashrc' file:
---------------------------------------------------------------------
export XKEYSYMDB=/usr/share/X11/XKeysymDB
---------------------------------------------------------------------
Then kill off all your nxclient sessions and start the nxclient again.  Keyboard input should work. This also addresses similar errors in older applications that use 'Motif' widgets (nedit, others). 

If it still does not work, you may have the wrong keyboard selected.  On bduc-login (GNOME Desktop), the way to select the correct keyboard is shown in the following images.   First, click to the Keyboard selection option (below left):

image:images/keyboard_select.png[Keyboard Option Path] image:images/keyboard_select2.png[Keyboard selection]

and then select the closest keyboard using the scrolling list (above, right):


If you start getting nx errors that result in a blank/black screen with a pop-up box that claims to be unable to complete: '/bin/bash -c "/etc/X11/xinit/Xsession gnome-session"', you may have damaged your '\~/.Xauthority' and/or '~/.ICEauthority' files.  The simple fix to this problem is to move them out of the way and then ssh into the node to re-create them.

==================================================================================


How to Manipulate Data on Linux
-------------------------------

This is a topic for a whole 'nother document named  http://moo.nac.uci.edu/~hjm/ManipulatingDataOnLinux.html[Manipulating Data on Linux] and the documents and sites referred to therein.

Frequently Asked Questions
--------------------------
OK, maybe not frequently, but cogently, and CAQ just doesn't have the same ring.
If you have other questions, please ask them.  If they address a frequent theme, I'll add them here.  In any case, I'll try to answer them.

[qanda]
.Q&A
What's a node?  Is it the same as a processor? ::
A node refers to a self-contained chassis that has its own power supply, motherboard (containing RAM, CPU, controllers, IO slots and devices (like ethernet ports), various wires and unidentifiable electrogrunge).  It usually contains a disk, altho this is not necessary with boot-over-the-network.  It's not the same as a processor.  Typical BDUC nodes (from the Jurassic period) have 2-4 CPU cores per node.  Modern nodes have 8 to >100 cores.

When I submit a .sh script with qsub, does the following line refer to 10 processors or 10 nodes? *#$ -pe openmpi 10*  ::
10 processor cores.  Most modern physical CPUs (the thing that plugs into the motherboard socket) have multiple processor cores internally these days.

What about the call to mpiexec? *mpiexec -np 10 nrniv -mpi -nobanner -nogui modelbal.hoc* ::
Same thing.  That's why they should be the same number.

Is it possible for the processors on one node to be working on different jobs? ::
Yes, altho the scheduler can be told to try to keep the jobs on 1 node (better for sharing memory objects like libs, but worse if there's significant contention for other resources like disk & network IO).  Most of the MPI environments on BDUC are currently set to spread out the jobs rather than bunch them together on as few nodes as possible.

If processor 1 (working on Job A) fails, does it bring down  processor 2 (working on Job B) as well? ::
No, and in fact it doesn't typically work that way. A job does not run
on a particular CPU; on a multi-core node, different threads of the
same job can hop among CPU cores.  The kernel allocates threads and
processes to whatever resources it has to optimize the job.

Is the performance of processor 1 dependent on whether processor 2 is engaged in the same or different job? ::
It depends. The computational bits of a thread, when they are being executed
on a CPU, don't interfere much with the other processor. They do share
memory, interrupts, and IO so if they're doing roughly the same
thing at roughly the same time, they'll typically want to read and
write at the same time and thus compete for those resources.  That was the rationale for 'spreading out' the MPI jobs rather than 'filling up' nodes.

Is it possible for one processor to use more than its "share" of the memory available to the node, i.e., is it wrong for me to count on having a certain amount of memory just because I've specified a certain number of processors (nodes?) for my job? ::
The CPU running prog1 will request the RAM that it needs independent
of other CPUs running prog1 or prog2, prog3, etc.  If the node gets close to running
out of real RAM, it will start to swap idle (haven't-been-
accessed-recently) pages of RAM to the disk, freeing up more RAM for
active programs.  If the computer runs out of both RAM and swap, it will hopefully
kill off the offending programs until it regains enough RAM to
function and then it will continue until it happens
again.  This is why you should try to estimate the amount of RAM your prog
will use and indicate that to the scheduler with the '-l mem_free' directive.  See link:#SGE_script_params[the section above.]

I can ssh to BDUC but I can't scp files to it.  Why? ::
Probably because you edited your '.bashrc' (or '.zrc' or '.tcshrc') to emit something useful when you log in.  (Both scp and ssh have a useful option '-v' that puts it into 'verbose' mode that tells you much more about what the process is doing and why it fails). You need to mask this output from non-interactive logins like 'scp' and remote 'ssh' execution by placing such commands inside a *test for an interactive shell*. When using bash, you would typically do something like this:
-------------------------------------------------------------------
interactive=`echo $- | grep -c i `
if [ ${interactive} = 1 ] ; then
  # tell me what my 22 latest files are
  ls -lt | head -22
fi
-------------------------------------------------------------------


Appendix
--------

[[clustercomputing]]
Cluster Computing
-----------------

What is a cluster?
~~~~~~~~~~~~~~~~~~
A compute cluster is typically composed of a pool of computers (aka nodes) that allow users (and there are usually several to several hundred simultaneous users) to spread compute jobs over them in a way that allows the maximum number of jobs to matched to number of computers.  The cluster is often composed of specialized login nodes, compute nodes, storage nodes, and specialty nodes (ie: a large memory node, a GPU node, an FPGA node, a database server node, etc)

The BDUC cluster consists of about 100 computers, each of which has 2-64 64bit CPU cores and 8-256GB RAM.  All these nodes have a small amount of local storage (filesystems (fs) that are directly connected with the node that hold its Operating System, a few utilities and some scratch space (in /scratch).  Some nodes have considerably larger local storage to provide more storage for a specific application or to the research group that bought it.  All the nodes communicate with each other over a private 1 Gb/s ethernet network, via a few central switches.  This means that each node can communucate at almost 100MB/s total bandwidth with all the other nodes but there is a bottleneck at the switches and at frequently used nodes, such as the login node and at main storage nodes. 


The login node
~~~~~~~~~~~~~~
In BDUC, the login node and a main storage are unfortunately the same one, leading to a bandwidth and CPU bottleneck at the login node.  This is one main reason to consider staging data elsewhere (on the remote node '/scratch' storage or on the '/gl' filesystem) for jobs that require large amounts of data I/O.  See the note on link:#stagingdata[staging data above].  

The CPU bottleneck is important because servicing disk read and write requests requires CPU attention in the same way that an application does.  BDUC's 100-odd compute nodes are busy consuming or producing data and quite a lot of the input and output is directed at the login node's storage.  So if someone is trying to run a compute-intensive application on the login node (especially a multicore application) it leaves very little CPU time to handle the disk requests.  The Operating System is fairly robust in that it will queue such requests quite deep in order to buffer them but if the CPU cannot keep up with the requests, eventually it will run out of queue depth and the system will lock up.

*SO, DO NOT RUN APPLICATIONS ON THE LOGIN NODE.*
 
It's fine to edit your program or data files (as long as they're under a few MB), and even compile your programs on the login node.  You can also test your programs as long as they run for very short periods - a few minutes at most.  If you need to run your programs interactively, request an interactive node via SGE or ssh directly to one of them (a64-00[12], claw[1-5]).

[[homevsgl]]
The difference between '/home' and '/gl'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As noted above, the main storage system for BDUC was the '/home' filesystem, provided by the bduc-login node.  The '/home' filesystem is a RAID6 of 15x1TB disks. 'RAID6' means that it can lose 2 disks before it will lose any data.  However, if more than 2 disks are lost, ALL data will be lost.  It has been supplemented by the '/gl' filesystem which is a distributed gluster fs.  On the gluster fs, data is spread file-wise over 8 RAID6s on 4 different servers, each of which hosts 1/4 of the files, so even if a whole node is destroyed, 3/4 of the files will survive.  This may be of little comfort if the one file you need is among the lost, but that's why we repeat the mantra 'Back up your files if they are of value.'

The *Strongly Suggested* approach is to put your code and and small intermediate analyses on '/home' and keep your large data and intermediate files on '/gl'.  In this way, you'll be able to search thru your files quickly, but when you submit large jobs to the cluster via SGE, they won't bog down the 'login' node, nor will they interfere with other cluster jobs since '/gl' is a distributed FS (and on average, only 1/4 of the cluster nodes will be interacting with the same storage node as your job).  In other words, it scales well. 


[[writeperfongl]]
Write performance on '/gl', especially with many small writes.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

IMPORTANT: *If you do not understand the following, please contact an admin to explain it; it's important.*

Unlike the http://en.wikipedia.org/wiki/Network_File_System[Network File System] (NFS) provided by '/home', the '/gl' filesystem (fs) has some odd quirks that can either destroy performance or be used to greatly enhance it. If you have written your own code or are using an app that writes zillions of tiny chunks of data to 'STDOUT', and you are storing the results on the '/gl' fs, you should consider passing the output thru 'gzip' to consolidate the writes into a continuous stream.  If you don't do this, each write will be considered a separate IO event and the performance on the '/gl' fs will be .. well .. atrocious.  We recently ran into this with the 'bedtools' utility 'genomeCoverageBed'.  It read in a multi-GB file quickly but then choked while writing the ~2GB output file, increasing the runtime by >30X.

If, however, the STDOUT is passed thru 'gzip', the wallclock runtime decreases even below the usual runtime and you end up with an output file that it already compressed to about 1/5 the usual size.


The here's how to do it using the explicit example we used and a generic one
--------------------------------------------------------------------------------
# all the files are on the /gl fs

module load bedtools/2.15.0
 ... 
cd /gl/where/your/files/are

# this is the usual (but incredibly slow) way
genomeCoverageBed -ibam RS_11261.bam  -g ref/dmel-all-chromosome-r5.1.fasta -d  > output.cov.gz

# this is the MUCH faster way, even tho you're using more CPU time 
# (gzip is in another process using another CPU)
genomeCoverageBed -ibam RS_11261.bam  -g ref/dmel-all-chromosome-r5.1.fasta -d |gzip > output.cov.gz
# .............................................................................^^^^^^

# the generic example:
someapp --opt1 --opt2 --input=/gl/me/input_file | gzip > /gl/me/output_file.
--------------------------------------------------------------------------------
[[namedpipes]]
Use *named pipes* instead of files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Note that the gzip trick mentioned above will only work with applications that write their data to 'STDOUT'.  If the app explicitly writes to a file, then you can use a http://en.wikipedia.org/wiki/Named_pipe[named pipe] (aka 'fifo' - first in, first out) to capture and process the output. 

--------------------------------------------------------------------------------
mkfifo /gl/me/npipe  # create a named pipe
# use the named pipe as the output file and send the process into the background with '&'
some_app --input=/gl/me/gigantic_input.fq.gz --out=/gl/me/npipe --match=33 --kmer=64 ... &

# now read from the named pipe and gzip it in one go
gzip < /gl/me/npipe > /gl/me/some_app.out.gz

# and finally delete the named pipe, just as you would a file.
rm /gl/me/npipe
--------------------------------------------------------------------------------


Some words about Big Data
~~~~~~~~~~~~~~~~~~~~~~~~~
To new users, especially to users who have never done BIG DATA work before:   Understand what it
is you're trying to do and what that means to the system. Consider the size of your data, the
pipes that you're trying to force it thru and what analyses you're trying to get it to perform.

It should not be be necessary to posit this, but there are clearly users who don't understand it.  
There is a '1000 fold difference' between each of these:

- 1,000 bytes, a KILOBYTE (KB) ~ an email
- 1,000,000 bytes, a MEGABYTE (MB) ~ a PhD thesis 
- 1,000,000,000 bytes, a GIGABYTE (GB) ~ 30 X the 10 Volume 'The Story of Civilization'. 
- 1,000,000,000,000 bytes, a TERABYTE (TB) ~ 1/10 of the text content of the Library of Congress.
- 1,000,000,000,000,000 bytes, a PETABYTE (PB) ~ 100 X the text content of the Library of Congress

BDUC has about 30TB of storage on '/gl' to be shared among 400 users, and the instantaneous needs of those users varies
tremendously. We do not use disk quotas to enforce user limits to allow substantial dynamic storage use.  However, if
you use hundreds of GB, the onus is on you to clean up your files and decrease that usage as soon as you're done with
it.

1 Big File vs Zillions of Tiny Files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This subject - arcane as it might seem - is important enough to merit its own subsection.  Because BDUC is community infrastructure, efficient use of its resources is important.  Let me try this analogy:  driving a car across country in 1st gear is possible, but it's not efficient, wastes time and energy, and will decrease the life of the engine.  Similarly, using Zillions Of Tiny files (ZOTfiles) is a similarly destructive practice.  A tiny file by itself is no more inefficient than a huge one.  If you have only 100bytes to store, store it in single file. However, the problems start compounding when there are many of them.  Because of the way data is stored on disk, 10 MB stored in ZOTfiles of 100bytes each can easily take up NOT 10MB, but more than 400MB - 40 times more space.
Worse, data stored in this manner makes many operations very slow - instead of looking up 1 directory entry, the OS has to look up 100,000.  This means 100,000 times more disk head movement, with a concommittent decrease in performance and disk lifetime.  If you are writing your own utilities, whether in Perl , C, Java, or Haskell, please use efficient data storage techniques, minimally as indexed file appending, preferably as 'real' data storage such as binary formats, http://www.hdfgroup.org/HDF5/[HDF5] and http://www.unidata.ucar.edu/software/netcdf/[netCDF], and don't forget about in-memory data compression (for example, using the excellent free http://zlib.net/[zlib library] or language-specific libraries that use compression, such as:
------------------------------------------------------------------------------------
libio-compress-perl - bundle of IO::Compress modules
python-snappy - Python library for the snappy compression library from Google
------------------------------------------------------------------------------------

If you are using someone else's analytical tools and you find they are writing ZOTfiles, ask them, 'plead with them' to fix this problem.  Despite the sophistication of the routines that may be in the tools, it is a mark of a poor programmer to continue this practice. 


[[HowtoPasswordlessSsh]]
HOWTO: Passwordless ssh
~~~~~~~~~~~~~~~~~~~~~~~
'Passwordless ssh' will allow you to ssh/scp to frequently used hosts without entering a passphrase each time.  *The process below works on Linux and Mac only*. Windows clients can do it as well, but it's a different procedure.  However, regardless of your desktop machine, you can use passwordless ssh to log in to all the nodes of the BDUC cluster once you've logged into the login node.


.Note for BDUC Parallel / MPICH2 Users
***************************************************
If you're going to be using MPI, via some variant of MPI (MPICH, MPICH2, OpenMPI),
or another parallel toolkit, you almost certainly will have to set this
up to work on BDUC so you (or your scripts) can passwordlessly ssh to
other nodes.  For BDUC users using only serial programs it can still be useful
as it cuts down on the amount of typing of passwords you'll have to do.

And it's dead simple.
***************************************************


In a terminal on your Mac or Linux machine, type:

-----------------------------------------------------------------------------
# for no passphrase, use
ssh-keygen -b 1024 -N ""

# if you want to use a passphrase:
ssh-keygen -b 1024 -N "your passphrase"
# but you probably /don't/ want a passphrase - else why would you be going thru this?
-----------------------------------------------------------------------------

save to the default places.

*For the BDUC cluster case:* Since all cluster nodes share a common */home*, all you have
to do is rename the public key file (normally *id_rsa.pub* in your ~/.ssh dir) to *authorized_keys*.

*For unrelated (non-cluster) hosts:* 'Linux users', use the 'ssh-copy-id' command, included as part of your ssh distribution. ('Mac users' will have to do it manually, described just below.)
'ssh-copy-id' does all the copying one shot, using your *\~/.ssh/id_rsa.pub* key (by default; use the -i option to specify another identity file, say *~/.ssh/id_dsa.pub* if you're using DSA keys)

-------------------------------------------------------
ssh-copy-id  your_bduc_login@bduc.nacs.uci.edu
# you'll have to enter your password one last time to get it there.
-------------------------------------------------------

What this does is to scp *id_rsa.pub* to the remote host (the ssh server your're trying to log into) and append that key to the remote file *~/.ssh/authorized_keys*.  If things don't work, check that the *id_rsa.pub* file has been appended correctly.

Then verify that it's worked by ssh'ing to BDUC.  You shouldn't have to enter a password anymore.


*For Mac users*, scp the same keys to the remote host and append your public key to the remote *~/.ssh/authorized_keys*.  Here are the commands below.  Just modify the UCINETID value and mouse them into the *Terminal* window on your local Mac.

-------------------------------------------------------
bash  # starts the bash shell just to make sure the rest of the commands work
cd    # makes sure you're in your local home dir
export UCINETID=""  # fill in the empty quotes with *your UCINETID*

# you'll need to enter the password manually for the next 2 commands)

scp ~/.ssh/id_rsa.pub ${UCINETID}@bduc-login.nacs.uci.edu:~/.ssh/id_rsa.pub
ssh ${UCINETID}@bduc-login.nacs.uci.edu 'cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys'

# and now you should be able to ssh in without a password
ssh ${UCINETID}@bduc-login.nacs.uci.edu

-------------------------------------------------------


.First time challenge from ssh
*******************************************************************
If this is the 1st time you're connecting to BDUC from your Mac (or PC), you'll get a
challenge like this:

-------------------------------------------------------
The authenticity of host 'bduc-login.nacs.uci.edu (128.200.15.20)' can't be established.
RSA key fingerprint is 57:70:23:8e:e1:15:8c:51:b0:52:ca:c7:a8:e9:26:9b.
Are you sure you want to continue connecting (yes/no)?
-------------------------------------------------------

and you have to type 'yes'.

For MPI / Parallel users, you should set up a local *~/.ssh/config* file to
tell ssh to ignore such requests.  The file should contain:

-------------------------------------------------------
Host *
   StrictHostKeyChecking no
-------------------------------------------------------

and must be chmod'ed to be readable only by you.  ie

-------------------------------------------------------
chmod go-rw ~/.ssh/config
-------------------------------------------------------


*******************************************************************


[[matlabcompiler]]
Notes on using the MATLAB comiler on the BDUC cluster
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(Thanks to 'Michael Vershinin' and 'Fan Wang' for their help and patience in debugging this procedure).

As noted, the official docs for compiling your MATLAB code is http://tinyurl.com/nebw3e[is described here].  Before you start hurling your '.m' code at the compiler, please read the following for some hints.

The following is a simple case where all the MATLAB code is in a single file, say 'test.m'.
Note that for the easiest path, you should write your MATLAB code to compile as a function.
This means that keyword 'function' has to be used to define the MATLAB code (link:#matlab_compile_example[see example below]).
If you want to pass parameters to the function, you have include a function parameter indicating this.
---------------------------------------------------------------------
# Before you use any MATLAB utilities, you will have to load the
# MATLAB environment via the 'module' command

module load matlab/R2009b

# for a C file dependency, you compile it with 'mex'.  Note that mex doesn't like
# C++ style comments (//), so you'll have to change them to the C style /* comment */

mex some_C_code.c    # -> produces 'some_C_code.mexa64'

# then compile the MATLAB code for a standalone application.
# (type mcc -? for all mcc options)

# If the m-code has a C file dependency which has already been mex-compiled,
# mcc will detect the requirement and link the '.mexa64' file automatically.

mcc -m test.m  # -> 'test'  (can take a minute or more)

# !! if you have additional files that are dependencies, you may have to define
# !! them via the '-I /path/to/dir' flags to describe the dirs where your
# !! additional m code resides.

# for a _C_ shared lib (named libmymatlib.so) with multiple input .m files

mcc -B csharedlib:libmymatlib file1.m file2.m file3.m


# for a _C++_ shared lib (named libmymatlib.so) with multiple input .m files

mcc -B cpplib:libmymatlib file1.m file2.m file3.m

---------------------------------------------------------------------

[[passingvars]]
Passing variables to compiled MATLAB applications
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Also, few programs will be useful with all the variables compiled statically.  There are a few ways to pass variables to the program - the easiest for a single or a few variables is to use the the http://www.mathworks.com/help/techdoc/ref/input.html[MATLAB 'input' function] to read in a character, string, or vector and process it internally to provide the required variables. 

Another way, especially if you have a large number of variables to pass, 'include the variables in a file' and feed that file to the matlab app. This will require that the matlab app is designed to read a file and parse it correctly. 

Both are described in some detail in the official MATLAB documentation http://www.mathworks.com/help/toolbox/compiler/f13-1005831.html#f13-1006802[Passing Arguments to and from a Standalone Application].

More examples are described http://its.virginia.edu/research/matlab/compiler.html#Example[here, in the example *function matlab_sim()*] and in the text following.


Files produced by the mcc compiler
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the 'standalone' case which will probably be the most popular approach on BDUC, the mcc compilation will generate a number of files:

---------------------------------------------------------------------
readme.txt  ...............  autogen'd description of the process
test   ....................  the 'semi-executable'
test.m  ...................  original 'm code'
test_main.c  ..............  C code wrapper for the converted m code
test_mcc_component_data.c .  m code translated into C code
run_test.sh  ..............  the script that wraps and runs the executable
test.prj  .................  XML description of the entire compilation
                               dependencies (Project file)
---------------------------------------------------------------------

In order to now run the executable, you often can't submit the auto-generated  'run_test.sh' directly in the SGE Q.
You have to submit it wrapped in a SGE script which finally calls the 'run_test.sh' script which sets up all the necessary
environment variables and paths to run the executable.  

So while you can test it for a few minutes like this:

---------------------------------------------------------------------
./run_test.sh [matlab_root] ./test

# where the [matlab_root] would be '/apps/matlab/r2009b' for the
# matlab version that supports the compiler

---------------------------------------------------------------------
Note that if you have already loaded the MATLAB module, you can usually run the compiled executable alone from the commandline.

However, for long/production runs, you will have to create a bash script (call it 'runmycode.sh') like this:

---------------------------------------------------------------------
#!/bin/bash

#$ -S /bin/bash          # run with this shell

#$ -N comp_matlab_run    # this name shows in qstat
#$ -q long               # run in this Q
#$ -l h_rt=50:00:00      # need 50 hour runtime
#$ -l mem_free=2G        # need 2GB free RAM
#$ -l scr_free=1G        # need 1 GB scratch space
#$ -cwd            # run the job out of the current directory 
                   # (the one from which you ran the script)


#$ -notify
#$ -M <email> - send mail about this job to the given email address.
#$ -m beas          # send a mail to owner when the job
#                       begins (b), ends (e), aborted (a),
#                       and suspended(s).

./run_test.sh  /apps/matlab/r2009b ./test
---------------------------------------------------------------------

and qsub it to SGE:

---------------------------------------------------------------------
qsub runmycode.sh
---------------------------------------------------------------------


[[matlab_compile_example]]
MATLAB Compilation Example
^^^^^^^^^^^^^^^^^^^^^^^^^^
Below is a very simple example showing how to compile and execute some MATLAB code.
Save the following code to a file named 'average.m'.
---------------------------------------------------------------------
function y = average(x)
% AVERAGE Mean of vector elements.
% AVERAGE(X) is the mean of vector, where X is a vector of
% elements. Nonvector input results in an error.
[m,n] = size(x);
if (~((m == 1) | (n == 1)) | (m == 1 & n == 1))
    error('Input must be a vector')
end
y = sum(x)/length(x);      % Actual computation
y
---------------------------------------------------------------------

Once the code is saved as 'average.m', compile by copying and pasting into a terminal window.

---------------------------------------------------------------------
module load matlab/R2009b   # load the MATLAB environment
mcc -m average.m;           # compile the code (takes many seconds)
z=1:99                      # assign the input vector to a shell variable
./average $z                # call the executable with the range.
---------------------------------------------------------------------

Note also that if you're going to run this under SGE as multiple instances, each instance will have to run with the appropriate MATLAB environment so you will have
to preface each exec with the 'module load matlab/R2009b' directive.


[[missinglibs]]
Resolving Missing Libraries
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Many of the problems we hear about are due to missing or incompatible library dependencies.  A complicated program (like R) has many such dependencies:
----------------------------------------------------------------------------
$ ldd libR.so
        linux-vdso.so.1 =>  (0x00007fff003fc000)
        libblas.so.3 => /usr/lib64/libblas.so.3 (0x00002b83c1c32000)
        libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x00002b83c1e88000)
        libm.so.6 => /lib64/libm.so.6 (0x00002b83c217c000)
        libreadline.so.5 => /apps/readline/5.2/lib/libreadline.so.5 (0x00002b83c23ff000)
        libncurses.so.5 => /usr/lib64/libncurses.so.5 (0x00002b83c263c000)
        libz.so.1 => /usr/NX/lib/libz.so.1 (0x00002b83c2899000)
        librt.so.1 => /lib64/librt.so.1 (0x00002b83c29ad000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002b83c2bb7000)
        libfunky.so.2 => not found
        libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x00002b83c2dbb000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b83c2fc8000)
        libc.so.6 => /lib64/libc.so.6 (0x00002b83c31e4000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003fe7600000)
        libgfortran.so.1 => /usr/lib64/libgfortran.so.1 (0x00002b83c353c000)

(there is no libfunky.so.2 dependency yet in R)
----------------------------------------------------------------------------

and each of them typically has more, so it's fairly common for an update to break such dependency chains, if only due to a few missing or changed functions.

If you run into a problem that seems to related to this, such as:

----------------------------------------------------------------------------
unable to load shared object '/apps/R/2.14.0/lib64/R/modules/libfunky.so.2':/
   libfrenemy.so.3: cannot open shared object file: No such file or 
directory
----------------------------------------------------------------------------
The above extract implies that the library 'libfunky.so.2' can't find 'libfrenemy.so.3' to resolve missing functions, so that lib may be missing on the node that emitted the error.

If this error is emitted from a node during a batch job, it may be hard to debug which nodes are in error.  To resolve this by yourself, it's sometimes useful to use http://moo.nac.uci.edu/~hjm/clusterfork/[clusterfork] to debug the problem.

In the above case, you would issue a command such as:
----------------------------------------------------------------------------
 cf --target=PERC 'module load R/2.14.0;  \
   ldd /apps/R/2.14.0/lib64/R/modules/libfunky.so.2 |grep found'
----------------------------------------------------------------------------
where the 'libfunky.so.2' is the library in question.  The results will capture the STDERR and STDOUT from the single-quoted command in node-named files in a subdir that begins with 'REMOTE_CMD-' in the working directory.  Examining those files usually identify the offending nodes.  

*Please be careful in using 'cf' since you can easily overwhelm the cluster if the command demands a lot of CPU or disk activity*.  Try the command on one node first to determine the effect and only issue the 'cf' command after you've perfected it.

Release information & Latest version
------------------------------------
The latest version of this document should always be available http://moo.nac.uci.edu/\~hjm/bduc/BDUC_USER_HOWTO.html[here].  The http://www.methods.co.nz/asciidoc/[asciidoc] source is available http://moo.nac.uci.edu/\~hjm/bduc/BDUC_USER_HOWTO.txt[here]. 

This document is released under the http://www.gnu.org/licenses/fdl.txt[GNU Free Documentation License].